Index-Based, High-Dimensional, Cosine Threshold Querying with Optimality Guarantees

Authors Yuliang Li, Jianguo Wang, Benjamin Pullman, Nuno Bandeira, Yannis Papakonstantinou



PDF
Thumbnail PDF

File

LIPIcs.ICDT.2019.11.pdf
  • Filesize: 1.03 MB
  • 20 pages

Document Identifiers

Author Details

Yuliang Li
  • Megagon Labs, Mountain View, California, USA
  • UC San Diego, San Diego, California, USA
Jianguo Wang
  • UC San Diego, San Diego, California, USA
Benjamin Pullman
  • UC San Diego, San Diego, California, USA
Nuno Bandeira
  • UC San Diego, San Diego, California, USA
Yannis Papakonstantinou
  • UC San Diego, San Diego, California, USA

Acknowledgements

We are very grateful to Victor Vianu who helped us significantly improve the presentation of the paper. We also thank the anonymous reviewers for the very constructive and helpful comments. This work was supported in part by the National Science Foundation (NSF) under awards BIGDATA 1447943 and ABI 1759980, and by the National Institutes of Health (NIH) under awards P41GM103484 and R24GM127667.

Cite AsGet BibTex

Yuliang Li, Jianguo Wang, Benjamin Pullman, Nuno Bandeira, and Yannis Papakonstantinou. Index-Based, High-Dimensional, Cosine Threshold Querying with Optimality Guarantees. In 22nd International Conference on Database Theory (ICDT 2019). Leibniz International Proceedings in Informatics (LIPIcs), Volume 127, pp. 11:1-11:20, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2019)
https://doi.org/10.4230/LIPIcs.ICDT.2019.11

Abstract

Given a database of vectors, a cosine threshold query returns all vectors in the database having cosine similarity to a query vector above a given threshold. These queries arise naturally in many applications, such as document retrieval, image search, and mass spectrometry. The present paper considers the efficient evaluation of such queries, providing novel optimality guarantees and exhibiting good performance on real datasets. We take as a starting point Fagin’s well-known Threshold Algorithm (TA), which can be used to answer cosine threshold queries as follows: an inverted index is first built from the database vectors during pre-processing; at query time, the algorithm traverses the index partially to gather a set of candidate vectors to be later verified against the similarity threshold. However, directly applying TA in its raw form misses significant optimization opportunities. Indeed, we first show that one can take advantage of the fact that the vectors can be assumed to be normalized, to obtain an improved, tight stopping condition for index traversal and to efficiently compute it incrementally. Then we show that one can take advantage of data skewness to obtain better traversal strategies. In particular, we show a novel traversal strategy that exploits a common data skewness condition which holds in multiple domains including mass spectrometry, documents, and image databases. We show that under the skewness assumption, the new traversal strategy has a strong, near-optimal performance guarantee. The techniques developed in the paper are quite general since they can be applied to a large class of similarity functions beyond cosine.

Subject Classification

ACM Subject Classification
  • Theory of computation → Data structures and algorithms for data management
  • Theory of computation → Database query processing and optimization (theory)
  • Information systems → Nearest-neighbor search
Keywords
  • Vector databases
  • Similarity search
  • Cosine
  • Threshold Algorithm

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Ruedi Aebersold and Matthias Mann. Mass-spectrometric exploration of proteome structure and function. Nature, 537:347-355, 2016. Google Scholar
  2. Thomas Dybdahl Ahle, Rasmus Pagh, Ilya Razenshteyn, and Francesco Silvestri. On the Complexity of Inner Product Similarity Join. In PODS, pages 151-164, 2016. Google Scholar
  3. David C. Anastasiu and George Karypis. L2AP: Fast cosine similarity search with prefix L-2 norm bounds. In ICDE, pages 784-795, 2014. Google Scholar
  4. David C. Anastasiu and George Karypis. PL2AP: Fast parallel cosine similarity search. In IA3, pages 8:1-8:8, 2015. Google Scholar
  5. Alexandr Andoni, Piotr Indyk, Thijs Laarhoven, Ilya Razenshteyn, and Ludwig Schmidt. Practical and Optimal LSH for Angular Distance. In NIPS, pages 1225-1233, 2015. Google Scholar
  6. Holger Bast, Debapriyo Majumdar, Ralf Schenkel, Martin Theobald, and Gerhard Weikum. IO-Top-k: Index-access Optimized Top-k Query Processing. In VLDB, pages 475-486, 2006. Google Scholar
  7. Roberto J. Bayardo, Yiming Ma, and Ramakrishnan Srikant. Scaling Up All Pairs Similarity Search. In WWW, pages 131-140, 2007. Google Scholar
  8. Alina Beygelzimer, Sham Kakade, and John Langford. Cover Trees for Nearest Neighbor. In ICML, pages 97-104, 2006. Google Scholar
  9. Christian Böhm, Stefan Berchtold, and Daniel A. Keim. Searching in High-dimensional Spaces: Index Structures for Improving the Performance of Multimedia Databases. CSUR, 33(3):322-373, 2001. Google Scholar
  10. Stephen Boyd and Lieven Vandenberghe. Convex optimization. Cambridge university press, 2004. Google Scholar
  11. Andrei Z. Broder, David Carmel, Michael Herscovici, Aya Soffer, and Jason Zien. Efficient Query Evaluation Using a Two-level Retrieval Process. In CIKM, pages 426-434, 2003. Google Scholar
  12. Nicolas Bruno, Luis Gravano, and Amélie Marian. Evaluating top-k queries over Web-accessible databases. In ICDE, pages 369-380, 2002. Google Scholar
  13. Lu Chen, Yunjun Gao, Baihua Zheng, Christian S. Jensen, Hanyu Yang, and Keyu Yang. Pivot-based Metric Indexing. PVLDB, 10(10):1058-1069, 2017. Google Scholar
  14. Mark De Berg, Otfried Cheong, Marc Van Kreveld, and Mark Overmars. Computational Geometry: Introduction. Springer, 2008. Google Scholar
  15. Prasad M Deshpande, Deepak P, and Krishna Kummamuru. Efficient Online top-K Retrieval with Arbitrary Similarity Measures. In EDBT, pages 356-367, 2008. Google Scholar
  16. Ronald Fagin, Amnon Lotem, and Moni Naor. Optimal Aggregation Algorithms for Middleware. In PODS, pages 102-113, 2001. Google Scholar
  17. Ronald Fagin, Amnon Lotem, and Moni Naor. Optimal aggregation algorithms for middleware. JCSS, 66(4):614-656, 2003. Google Scholar
  18. Ulrich Güntzer, Wolf-Tilo Balke, and Werner Kiebling. Optimizing Multi-Feature Queries for Image Databases. In VLDB, pages 419-428, 2000. Google Scholar
  19. Vagelis Hristidis, Nick Koudas, and Yannis Papakonstantinou. PREFER: A system for the efficient execution of multi-parametric ranked queries. In SIGMOD, pages 259-270, 2001. Google Scholar
  20. Xiao Hu, Yufei Tao, and Ke Yi. Output-optimal Parallel Algorithms for Similarity Joins. In PODS, pages 79-90, 2017. Google Scholar
  21. Ihab F. Ilyas, George Beskales, and Mohamed A. Soliman. A Survey of Top-k Query Processing Techniques in Relational Database Systems. CSUR, 40(4):1-58, 2008. Google Scholar
  22. Piotr Indyk and Rajeev Motwani. Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality. In ICDT, pages 604-613, 1998. Google Scholar
  23. Harold W Kuhn and Albert W Tucker. Nonlinear programming. In Traces and Emergence of Nonlinear Programming, pages 247-258. Springer, 2014. Google Scholar
  24. B. Kulis and K. Grauman. Kernelized locality-sensitive hashing for scalable image search. In ICCV, pages 2130-2137, 2009. Google Scholar
  25. Henry Lam, Eric W. Deutsch, James S. Eddes, Jimmy K. Eng, Nichole King, Stephen E. Stein, and Ruedi Aebersold. Development and validation of a spectral library searching method for peptide identification from MS/MS. Proteomics, 7(5), 2007. Google Scholar
  26. Hui Li, Tsz Nam Chan, Man Lung Yiu, and Nikos Mamoulis. FEXIPRO: Fast and Exact Inner Product Retrieval in Recommender Systems. In SIGMOD, pages 835-850, 2017. Google Scholar
  27. Anand Rajaraman and Jeffrey David Ullman. Mining of Massive Datasets. Cambridge University Press, 2011. Google Scholar
  28. Sharadh Ramaswamy and Kenneth Rose. Adaptive Cluster Distance Bounding for High-Dimensional Indexing. TKDE, 23(6):815-830, 2011. Google Scholar
  29. Hanan Samet. Foundations of Multidimensional and Metric Data Structures. Morgan Kaufmann Publishers Inc., 2005. Google Scholar
  30. Yufei Tao, Ke Yi, Cheng Sheng, and Panos Kalnis. Quality and Efficiency in High Dimensional Nearest Neighbor Search. In SIGMOD, pages 563-576, 2009. Google Scholar
  31. Christina Teflioudi and Rainer Gemulla. Exact and Approximate Maximum Inner Product Search with LEMP. TODS, 42(1):5:1-5:49, 2016. Google Scholar
  32. Christina Teflioudi, Rainer Gemulla, and Olga Mykytiuk. LEMP: Fast Retrieval of Large Entries in a Matrix Product. In SIGMOD, pages 107-122, 2015. Google Scholar
  33. Mingxun Wang and Nuno Bandeira. Spectral Library Generating Function for Assessing Spectrum-Spectrum Match Significance. Journal of Proteome Research, 12(9):3944-3951, 2013. Google Scholar
  34. Roger Weber, Hans-Jörg Schek, and Stephen Blott. A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional Spaces. In VLDB, pages 194-205, 1998. Google Scholar
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail