fbpx
Wikipedia

Similarity search

Similarity search is the most general term used for a range of mechanisms which share the principle of searching (typically very large) spaces of objects where the only available comparator is the similarity between any pair of objects. This is becoming increasingly important in an age of large information repositories where the objects contained do not possess any natural order, for example large collections of images, sounds and other sophisticated digital objects.

Nearest neighbor search and range queries are important subclasses of similarity search, and a number of solutions exist. Research in similarity search is dominated by the inherent problems of searching over complex objects. Such objects cause most known techniques to lose traction over large collections, due to a manifestation of the so-called curse of dimensionality, and there are still many unsolved problems. Unfortunately, in many cases where similarity search is necessary, the objects are inherently complex.

The most general approach to similarity search relies upon the mathematical notion of metric space, which allows the construction of efficient index structures in order to achieve scalability in the search domain.

Similarity search evolved independently in a number of different scientific and computing contexts, according to various needs. In 2008 a few leading researchers in the field felt strongly that the subject should be a research topic in its own right, to allow focus on the general issues applicable across the many diverse domains of its use. This resulted in the formation of the SISAP foundation, whose main activity is a series of annual international conferences on the generic topic.

Metric search edit

Metric search is similarity search which takes place within metric spaces. While the semimetric properties are more or less necessary for any kind of search to be meaningful, the further property of triangle inequality is useful for engineering, rather than conceptual, purposes.

A simple corollary of triangle inequality is that, if any two objects within the space are far apart, then no third object can be close to both. This observation allows data structures to be built, based on distances measured within the data collection, which allow subsets of the data to be excluded when a query is executed. As a simple example, a reference object can be chosen from the data set, and the remainder of the set divided into two parts based on distance to this object: those close to the reference object in set A, and those far from the object in set B. If, when the set is later queried, the distance from the query to the reference object is large, then none of the objects within set A can be very close to the query; if it is very small, then no object within set B can be close to the query.

Once such situations are quantified and studied, many different metric indexing structures can be designed, variously suitable for different types of collections. The research domain of metric search can thus be characterised as the study of pre-processing algorithms over large and relatively static collections of data which, using the properties of metric spaces, allow efficient similarity search to be performed.


Types edit

Locality-sensitive hashing edit

A popular approach for similarity search is locality sensitive hashing (LSH).[1] It hashes input items so that similar items map to the same "buckets" in memory with high probability (the number of buckets being much smaller than the universe of possible input items). It is often applied in nearest neighbor search on large scale high-dimensional data, e.g., image databases, document collections, time-series databases, and genome databases.[2]

See also edit

SISAP foundation and conference series: www.sisap.org

Bibliography edit

  • Pei Lee, Laks V. S. Lakshmanan, Jeffrey Xu Yu: On Top-k Structural Similarity Search. ICDE 2012:774-785
  • Zezula, P., Amato, G., Dohnal, V., and Batko, M. Similarity Search - The Metric Space Approach. Springer, 2006. ISBN 0-387-29146-6
  • Samet, H.. Foundations of Multidimensional and Metric Data Structures. Morgan Kaufmann, 2006. ISBN 0-12-369446-9
  • E. Chavez, G. Navarro, R.A. Baeza-Yates, J.L. Marroquin, Searching in metric spaces, ACM Computing Surveys, 2001
  • M.L. Hetland, The Basic Principles of Metric Indexing, Swarm Intelligence for Multi-objective Problems in Data Mining, Studies in Computational Intelligence Volume 242, 2009, pp 199–232

Resources edit

  • The Multi-Feature Indexing Network (MUFIN) Project
  • MI-File (Metric Inverted File)
  • Content-based Photo Image Retrieval Test-Collection (CoPhIR)

Benchmarks edit

  • ANN-Benchmarks, for approximate nearest neighbor algorithms search; by Spotify

References edit

  1. ^ Gionis, Aristides, Piotr Indyk, and Rajeev Motwani. "Similarity search in high dimensions via hashing." VLDB. Vol. 99. No. 6. 1999.
  2. ^ Rajaraman, A.; Ullman, J. (2010). "Mining of Massive Datasets, Ch. 3".

similarity, search, most, general, term, used, range, mechanisms, which, share, principle, searching, typically, very, large, spaces, objects, where, only, available, comparator, similarity, between, pair, objects, this, becoming, increasingly, important, larg. Similarity search is the most general term used for a range of mechanisms which share the principle of searching typically very large spaces of objects where the only available comparator is the similarity between any pair of objects This is becoming increasingly important in an age of large information repositories where the objects contained do not possess any natural order for example large collections of images sounds and other sophisticated digital objects Nearest neighbor search and range queries are important subclasses of similarity search and a number of solutions exist Research in similarity search is dominated by the inherent problems of searching over complex objects Such objects cause most known techniques to lose traction over large collections due to a manifestation of the so called curse of dimensionality and there are still many unsolved problems Unfortunately in many cases where similarity search is necessary the objects are inherently complex The most general approach to similarity search relies upon the mathematical notion of metric space which allows the construction of efficient index structures in order to achieve scalability in the search domain Similarity search evolved independently in a number of different scientific and computing contexts according to various needs In 2008 a few leading researchers in the field felt strongly that the subject should be a research topic in its own right to allow focus on the general issues applicable across the many diverse domains of its use This resulted in the formation of the SISAP foundation whose main activity is a series of annual international conferences on the generic topic Contents 1 Metric search 2 Types 2 1 Locality sensitive hashing 3 See also 4 Bibliography 5 Resources 5 1 Benchmarks 6 ReferencesMetric search editMetric search is similarity search which takes place within metric spaces While the semimetric properties are more or less necessary for any kind of search to be meaningful the further property of triangle inequality is useful for engineering rather than conceptual purposes A simple corollary of triangle inequality is that if any two objects within the space are far apart then no third object can be close to both This observation allows data structures to be built based on distances measured within the data collection which allow subsets of the data to be excluded when a query is executed As a simple example a reference object can be chosen from the data set and the remainder of the set divided into two parts based on distance to this object those close to the reference object in set A and those far from the object in set B If when the set is later queried the distance from the query to the reference object is large then none of the objects within set A can be very close to the query if it is very small then no object within set B can be close to the query Once such situations are quantified and studied many different metric indexing structures can be designed variously suitable for different types of collections The research domain of metric search can thus be characterised as the study of pre processing algorithms over large and relatively static collections of data which using the properties of metric spaces allow efficient similarity search to be performed Types editLocality sensitive hashing edit A popular approach for similarity search is locality sensitive hashing LSH 1 It hashes input items so that similar items map to the same buckets in memory with high probability the number of buckets being much smaller than the universe of possible input items It is often applied in nearest neighbor search on large scale high dimensional data e g image databases document collections time series databases and genome databases 2 See also editSISAP foundation and conference series www sisap org Similarity learning Latent semantic analysisBibliography editPei Lee Laks V S Lakshmanan Jeffrey Xu Yu On Top k Structural Similarity Search ICDE 2012 774 785 Zezula P Amato G Dohnal V and Batko M Similarity Search The Metric Space Approach Springer 2006 ISBN 0 387 29146 6 Samet H Foundations of Multidimensional and Metric Data Structures Morgan Kaufmann 2006 ISBN 0 12 369446 9 E Chavez G Navarro R A Baeza Yates J L Marroquin Searching in metric spaces ACM Computing Surveys 2001 M L Hetland The Basic Principles of Metric Indexing Swarm Intelligence for Multi objective Problems in Data Mining Studies in Computational Intelligence Volume 242 2009 pp 199 232Resources editThe Multi Feature Indexing Network MUFIN Project MI File Metric Inverted File Content based Photo Image Retrieval Test Collection CoPhIR Benchmarks edit ANN Benchmarks for approximate nearest neighbor algorithms search by SpotifyReferences edit Gionis Aristides Piotr Indyk and Rajeev Motwani Similarity search in high dimensions via hashing VLDB Vol 99 No 6 1999 Rajaraman A Ullman J 2010 Mining of Massive Datasets Ch 3 Retrieved from https en wikipedia org w index php title Similarity search amp oldid 1212411450, wikipedia, wiki, book, books, library,

article

, read, download, free, free download, mp3, video, mp4, 3gp, jpg, jpeg, gif, png, picture, music, song, movie, book, game, games.