Similarity Search - The Metric Space Approach
Pavel Zezula Springer ISBN: 0-387-29146-6 |
In the Information Society, information holds the master key to economic influence. Similarity Search-The Metric Space Approach focuses on efficient ways to locate user-relevant information in collections of objects, the similarity of which is quantified using a pairwise distance measure. This book is a direct response to recent advances in computing, communications and storage which have led to the current flood of digital libraries, data warehouses and the limitless heterogeneity of internet resources.
Similarity Search-The Metric Space Approach introduces state-of-the-art in developing index structures for searching complex data modeled as instances of a metric space.
This book consists of two parts:
Part I presents the metric search approach in a nutshell by defining the problem, describes major theoretical principles, and provides an extensive survey of specific techniques for a large range of applications.
Part II concentrates on approaches particularly designed for searching in large collections of data. After describing the most popular centralized disk-based metric indexes, we present approximation techniques are presented as a way to significantly speed up search time at the expense of some imprecision in query results, and finally we discuss parallel and distributed disk-based metric indexes.
Similarity Search-The Metric Space Approach is designed for a professional audience, composed of academic researchers as well as practitioners in industry. This book is also suitable as introductory material for graduate-level students in computer science.
Part I: Metric searching in a Nutshell
Similarity searching has become a fundamental computational task in a variety of application areas, including multimedia information retrieval, data mining, pattern recognition, machine learning, computer vision, biomedical databases, data compression and statistical data analysis. In such environments, an exact match has little meaning, and proximity/distance concepts (similarity/dissimilarity) are typically much more fruitful for searching. In this part of the book we explain how the metric space approach can be effectively and efficiently used to deal with this problem.
Part I contains the following two chapters:
Chapter 1: Foundation of metric spaces
In this chapter we introduce the problem of metric searching and justify its importance with respect to other approaches. After defining a metric space we show examples of several distance measures which are used for searching in diverse data collections. We survey the paradigms used to express queries and we discuss the basic partitioning principles, which are at the basis of query execution. The reminder of this chapter is devoted to performance related issues. Specifically, we discuss techniques aimed at reducing the number of distance computations, we present useful metric space transformations, and we explain concepts of approximate similarity search. Finally, we provide a collection of analytic tools and approaches especially developed for metric index structures. Go back
Chapter 2: Survey of existing approaches
In this chapter, we give an overview of existing indexes for metric spaces. In the interests of a systematic presentation, we have divided the individual techniques into four groups. In addition we also present some techniques for approximate similarity search. Specifically, we first explain techniques which make use of ball partitioning, then we describe indexing approaches based on generalized hyperplane partitioning. A significant group of indexing methods that we also report are those that compute distances to characteristic objects and then use these results to organize the data. In order to maximize performance, many approaches combine several of the basic principles into a single index. In this chapter we also discuss the most important of these hybrid approaches. Finally, we treat the important topic of approximate similarity search, which trades some precision in search results for significant improvements in performance. Go back
Part II: Metric Searching in Large Collections of Data
Database scalability is a topic which has been well-explored and much-debated, but there are still no easy answers. There are several ways to achieve higher scalability of a database, but which of them to choose depends greatly upon the unique needs of individual users. Creating a search index structure which scales to very large dimensions presents many challenges, and the task is becoming increasingly difficult as the amount of data grows. In this part of the book, we assume that an index for processing large data stores and accesses indexed features on a secondary memory, that is on a disk. Go back
Part II contains the following three chapters:
Chapter 3: Centralized index structures
Most existing search structures have been designed to run on a single computer. They are built with different assumptions about type of distance function (discrete vs. continuous), form of query (range, nearest neighbor, etc.), and temporal properties (static or dynamic) of the data to be organized. Although many index structures have been proposed as main memory structures, there are several indexes which organize data using disk storage to allow the processing of a large volume of data. We focus on two basic approaches which store objects in secondary memories. Specifically, we discuss tree-based structures and methods which employ hashing (i.e., the key-to-address transformation) principles. Go back
Chapter 4: Approximate similarity search
Similarity search in metric spaces is generally expensive and state-of-the art access methods still do not provide an acceptable response time for highly interactive applications. Fortunately, in many applications it is sufficient to perform an approximate similarity search where an inaccurate result-set is obtained. The attractiveness of this approach is emphasized by the fact that the approximate search is typically performed much faster. In this chapter, we present some of the most relevant approximate similarity search strategies. Finally, the pros and cons of approximate similarity search are treated, and evidence provided by tests conducted on real datasets is discussed. Go back
Chapter 5: Parallel and distributed indexes
Centralized metric indexes achieve a significant speedup when compared to the sequential scan. However, experience with centralized methods reveals that query execution cost increases linearly with the growth of the dataset. In this chapter, we present methods which solve this problem by exploiting parallel computing power. The idea behind is easy in principle: as the dataset grows in size, more independent computation and storage resources are added (CPUs, disks, etc.), keeping the query response time low. Go back