Improved Space-Efficient Approximate Nearest Neighbor Search Using Function Inversion

Author Samuel McCauley



PDF
Thumbnail PDF

File

LIPIcs.ESA.2024.88.pdf
  • Filesize: 0.79 MB
  • 19 pages

Document Identifiers

Author Details

Samuel McCauley
  • Williams College Computer Science, Williamstown, MA, USA

Acknowledgements

I would like to thank Rasmus Pagh and Martin Aumüller for helpful discussions, and the anonymous reviewers for their comments and suggestions.

Cite AsGet BibTex

Samuel McCauley. Improved Space-Efficient Approximate Nearest Neighbor Search Using Function Inversion. In 32nd Annual European Symposium on Algorithms (ESA 2024). Leibniz International Proceedings in Informatics (LIPIcs), Volume 308, pp. 88:1-88:19, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2024)
https://doi.org/10.4230/LIPIcs.ESA.2024.88

Abstract

Approximate nearest neighbor search (ANN) data structures have widespread applications in machine learning, computational biology, and text processing. The goal of ANN is to preprocess a set S so that, given a query q, we can find a point y whose distance from q approximates the smallest distance from q to any point in S. For most distance functions, the best-known ANN bounds for high-dimensional point sets are obtained using techniques based on locality-sensitive hashing (LSH). Unfortunately, space efficiency is a major challenge for LSH-based data structures. Classic LSH techniques require a very large amount of space, oftentimes polynomial in |S|. A long line of work has developed intricate techniques to reduce this space usage, but these techniques suffer from downsides: they must be hand tailored to each specific LSH, are often complicated, and their space reduction comes at the cost of significantly increased query times. In this paper we explore a new way to improve the space efficiency of LSH using function inversion techniques, originally developed in (Fiat and Naor 2000). We begin by describing how function inversion can be used to improve LSH data structures. This gives a fairly simple, black box method to reduce LSH space usage. Then, we give a data structure that leverages function inversion to improve the query time of the best known near-linear space data structure for approximate nearest neighbor search under Euclidean distance: the ALRW data structure of (Andoni, Laarhoven, Razenshteyn, and Waingarten 2017). ALRW was previously shown to be optimal among "list-of-points" data structures for both Euclidean and Manhattan ANN; thus, in addition to giving improved bounds, our results imply that list-of-points data structures are not optimal for Euclidean or Manhattan ANN .

Subject Classification

ACM Subject Classification
  • Theory of computation → Sorting and searching
  • Theory of computation → Data compression
Keywords
  • similarity search
  • locality-sensitive hashing
  • randomized algorithms
  • data structures
  • space efficiency
  • function inversion

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Thomas D Ahle, Martin Aumüller, and Rasmus Pagh. Parameter-free locality sensitive hashing for spherical range reporting. In Proceedings of the Twenty-Eighth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 239-256. SIAM, 2017. Google Scholar
  2. Thomas Dybdahl Ahle. Optimal las vegas locality sensitive data structures. In 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS), pages 938-949. IEEE, 2017. Google Scholar
  3. Thomas Dybdahl Ahle, Rasmus Pagh, Ilya Razenshteyn, and Francesco Silvestri. On the complexity of inner product similarity join. In Proceedings of the 35th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, pages 151-164, 2016. Google Scholar
  4. Josh Alman and Ryan Williams. Probabilistic polynomials and hamming nearest neighbors. In 2015 IEEE 56th Annual Symposium on Foundations of Computer Science, pages 136-150. IEEE, 2015. Google Scholar
  5. Alexandr Andoni. Nearest neighbor search: the old, the new, and the impossible. PhD thesis, Massachusetts Institute of Technology, 2009. Google Scholar
  6. Alexandr Andoni and Piotr Indyk. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. In 47th Annual Symposium on Foundations of Computer Science (FOCS), pages 459-468. IEEE, 2006. Google Scholar
  7. Alexandr Andoni and Piotr Indyk. Nearest neighbors in high-dimensional spaces. In Handbook of Discrete and Computational Geometry, pages 1135-1155. Chapman and Hall/CRC, 2017. Google Scholar
  8. Alexandr Andoni, Piotr Indyk, Thijs Laarhoven, Ilya Razenshteyn, and Ludwig Schmidt. Practical and optimal lsh for angular distance. Advances in neural information processing systems, 28, 2015. Google Scholar
  9. Alexandr Andoni, Piotr Indyk, Huy L Nguyen, and Ilya Razenshteyn. Beyond locality-sensitive hashing. In Proceedings of the twenty-fifth annual ACM-SIAM symposium on Discrete algorithms, pages 1018-1028. SIAM, 2014. Google Scholar
  10. Alexandr Andoni, Thijs Laarhoven, Ilya Razenshteyn, and Erik Waingarten. Lower bounds on time-space trade-offs for approximate near neighbors. arXiv preprint arXiv:1605.02701, 2016. Google Scholar
  11. Alexandr Andoni, Thijs Laarhoven, Ilya Razenshteyn, and Erik Waingarten. Optimal hashing-based time-space trade-offs for approximate near neighbors. In Proceedings of the twenty-eighth annual ACM-SIAM symposium on discrete algorithms, pages 47-66. SIAM, 2017. Google Scholar
  12. Alexandr Andoni, Thijs Laarhoven, Ilya P. Razenshteyn, and Erik Waingarten. Optimal hashing-based time-space trade-offs for approximate near neighbors. CoRR, 2016. URL: https://arxiv.org/abs/1608.03580.
  13. Alexandr Andoni and Ilya Razenshteyn. Optimal data-dependent hashing for approximate near neighbors. In Proceedings of the forty-seventh annual ACM symposium on Theory of computing, pages 793-801, 2015. Google Scholar
  14. Boris Aronov, Jean Cardinal, Justin Dallant, and John Iacono. A general technique for searching in implicit sets via function inversion. arXiv preprint arXiv:2311.12471, 2023. Google Scholar
  15. Boris Aronov, Esther Ezra, Micha Sharir, and Guy Zigdon. Time and space efficient collinearity indexing. Computational Geometry, 110:101963, 2023. Google Scholar
  16. Martin Aumüller, Erik Bernhardsson, and Alexander Faithfull. Ann-benchmarks: A benchmarking tool for approximate nearest neighbor algorithms. Information Systems, 87:101374, 2020. Google Scholar
  17. Martin Aumuller, Sariel Har-Peled, Sepideh Mahabadi, Rasmus Pagh, and Francesco Silvestri. Fair near neighbor search via sampling. ACM SIGMOD Record, 50(1):42-49, 2021. Google Scholar
  18. Martin Aumüller, Rasmus Pagh, and Francesco Silvestri. Fair near neighbor search: Independent range sampling in high dimensions. In Proceedings of the 39th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, pages 191-204, 2020. Google Scholar
  19. Elad Barkan, Eli Biham, and Adi Shamir. Rigorous bounds on cryptanalytic time/memory tradeoffs. In Annual International Cryptology Conference, pages 1-21. Springer, 2006. Google Scholar
  20. Pavel Berkhin. A survey of clustering data mining techniques. In Grouping multidimensional data: Recent advances in clustering, pages 25-71. Springer, 2006. Google Scholar
  21. Philip Bille, Inge Li Gørtz, Moshe Lewenstein, Solon P Pissis, Eva Rotenberg, and Teresa Anna Steiner. Gapped string indexing in subquadratic space and sublinear query time. arXiv preprint arXiv:2211.16860, 2022. Google Scholar
  22. Andrei Z Broder. On the resemblance and containment of documents. In Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No. 97TB100171), pages 21-29. IEEE, 1997. Google Scholar
  23. Herman Chernoff. A measure of asymptotic efficiency for tests of a hypothesis based on the sum of observations. The Annals of Mathematical Statistics, pages 493-507, 1952. Google Scholar
  24. Tobias Christiani. A framework for similarity search with space-time tradeoffs using locality-sensitive filtering. In Proceedings of the Twenty-Eighth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 31-46. SIAM, 2017. Google Scholar
  25. Tobias Christiani and Rasmus Pagh. Set similarity search beyond minhash. In Proceedings of the 49th annual ACM SIGACT symposium on theory of computing, pages 1094-1107, 2017. Google Scholar
  26. Henry Corrigan-Gibbs and Dmitry Kogan. The function-inversion problem: Barriers and opportunities. In Theory of Cryptography Conference, pages 393-421. Springer, 2019. Google Scholar
  27. Thomas Cover and Peter Hart. Nearest neighbor pattern classification. IEEE transactions on information theory, 13(1):21-27, 1967. Google Scholar
  28. Mayur Datar, Nicole Immorlica, Piotr Indyk, and Vahab S Mirrokni. Locality-sensitive hashing scheme based on p-stable distributions. In Proceedings of the twentieth annual symposium on Computational geometry, pages 253-262, 2004. Google Scholar
  29. Anne Driemel and Francesco Silvestri. Locality-sensitive hashing of curves. In 33rd International Symposium on Computational Geometry (SoCG 2017). Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2017. Google Scholar
  30. Amos Fiat and Moni Naor. Rigorous time/space trade-offs for inverting functions. SIAM Journal on Computing, 29(3):790-803, 2000. Google Scholar
  31. Alexander Golovnev, Siyao Guo, Thibaut Horel, Sunoo Park, and Vinod Vaikuntanathan. Data structures meet cryptography: 3sum with preprocessing. In Proceedings of the 52nd annual ACM SIGACT symposium on theory of computing, pages 294-307, 2020. Google Scholar
  32. Alexander Golovnev, Siyao Guo, Spencer Peters, and Noah Stephens-Davidowitz. Revisiting time-space tradeoffs for function inversion. In Annual International Cryptology Conference, pages 453-481. Springer, 2023. Google Scholar
  33. David Gorisse, Matthieu Cord, and Frederic Precioso. Locality-sensitive hashing for chi2 distance. IEEE transactions on pattern analysis and machine intelligence, 34(2):402-409, 2011. Google Scholar
  34. Christian Hachenberg and Thomas Gottron. Locality sensitive hashing for scalable structural classification and clustering of web documents. In Proceedings of the 22nd ACM international conference on Information & Knowledge Management, pages 359-368, 2013. Google Scholar
  35. Sariel Har-Peled, Piotr Indyk, and Rajeev Motwani. Approximate nearest neighbor: Towards removing the curse of dimensionality. Theory OF Computing, 8:321-350, 2012. Google Scholar
  36. Martin Hellman. A cryptanalytic time-memory trade-off. IEEE transactions on Information Theory, 26(4):401-406, 1980. Google Scholar
  37. Piotr Indyk. High-dimensional computational geometry. PhD thesis, Stanford University, 2001. Google Scholar
  38. Piotr Indyk and Rajeev Motwani. Approximate nearest neighbors: towards removing the curse of dimensionality. In Proceedings of the thirtieth annual ACM symposium on Theory of computing, pages 604-613, 1998. Google Scholar
  39. Michael Kapralov. Smooth tradeoffs between insert and query complexity in nearest neighbor search. In Proceedings of the 34th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, pages 329-342, 2015. Google Scholar
  40. Kifayat Ullah Khan, Batjargal Dolgorsuren, Tu Nguyen Anh, Waqas Nawaz, and Young-Koo Lee. Faster compression methods for a weighted graph using locality sensitive hashing. Information Sciences, 421:237-253, 2017. Google Scholar
  41. Tsvi Kopelowitz and Ely Porat. The strong 3sum-indexing conjecture is false. arXiv preprint arXiv:1907.11206, 2019. Google Scholar
  42. Thijs Laarhoven. Tradeoffs for nearest neighbors on the sphere. arXiv preprint arXiv:1511.07527, 2015. Google Scholar
  43. Thijs Laarhoven. Graph-based time-space trade-offs for approximate near neighbors. In 34th International Symposium on Computational Geometry (SoCG 2018). Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2018. Google Scholar
  44. Nathan Linial, Eran London, and Yuri Rabinovich. The geometry of graphs and some of its algorithmic applications. Combinatorica, 15:215-245, 1995. Google Scholar
  45. Guillaume Marçais, Dan DeBlasio, Prashant Pandey, and Carl Kingsford. Locality-sensitive hashing for the edit distance. Bioinformatics, 35(14):i127-i135, 2019. Google Scholar
  46. Samuel McCauley. Approximate similarity search under edit distance using locality-sensitive hashing. In 24th International Conference on Database Theory, 2021. Google Scholar
  47. Samuel McCauley, Jesper W Mikkelsen, and Rasmus Pagh. Set similarity search for skewed data. In Proceedings of the 37th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, pages 63-74, 2018. Google Scholar
  48. Marvin Minsky and Seymour A. Papert. Perceptrons: An Introduction to Computational Geometry. MIT Press, Cambridge, 1969. Google Scholar
  49. Michael Mitzenmacher and Eli Upfal. Probability and computing: Randomization and probabilistic techniques in algorithms and data analysis. Cambridge university press, 2017. Google Scholar
  50. Rina Panigrahy. Entropy based nearest neighbor search in high dimensions. In Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm, pages 1186-1195, 2006. Google Scholar
  51. Aviad Rubinstein. Hardness of approximate nearest neighbor search. In Proceedings of the 50th annual ACM SIGACT symposium on theory of computing, pages 1260-1268, 2018. Google Scholar
  52. Gerard Salton, Anita Wong, and Chung-Shu Yang. A vector space model for automatic indexing. Communications of the ACM, 18(11):613-620, 1975. Google Scholar
  53. Francisco Santoyo, Edgar Chavez, and Eric S Tellez. Compressing locality sensitive hashing tables. In 2013 Mexican International Conference on Computer Science, pages 41-46. IEEE, 2013. Google Scholar
  54. Ryan Williams. On the difference between closest, furthest, and orthogonal pairs: Nearly-linear vs barely-subquadratic complexity. In Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 1207-1215. SIAM, 2018. Google Scholar
  55. Andrew Chi-Chih Yao. Coherent functions and program checkers. In Proceedings of the twenty-second annual ACM symposium on Theory of computing, pages 84-94, 1990. Google Scholar
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail