Improved Space-Efficient Approximate Nearest Neighbor Search Using Function Inversion
Approximate nearest neighbor search (ANN) data structures have widespread applications in machine learning, computational biology, and text processing. The goal of ANN is to preprocess a set S so that, given a query q, we can find a point y whose distance from q approximates the smallest distance from q to any point in S. For most distance functions, the best-known ANN bounds for high-dimensional point sets are obtained using techniques based on locality-sensitive hashing (LSH).
Unfortunately, space efficiency is a major challenge for LSH-based data structures. Classic LSH techniques require a very large amount of space, oftentimes polynomial in |S|. A long line of work has developed intricate techniques to reduce this space usage, but these techniques suffer from downsides: they must be hand tailored to each specific LSH, are often complicated, and their space reduction comes at the cost of significantly increased query times.
In this paper we explore a new way to improve the space efficiency of LSH using function inversion techniques, originally developed in (Fiat and Naor 2000).
We begin by describing how function inversion can be used to improve LSH data structures. This gives a fairly simple, black box method to reduce LSH space usage.
Then, we give a data structure that leverages function inversion to improve the query time of the best known near-linear space data structure for approximate nearest neighbor search under Euclidean distance: the ALRW data structure of (Andoni, Laarhoven, Razenshteyn, and Waingarten 2017). ALRW was previously shown to be optimal among "list-of-points" data structures for both Euclidean and Manhattan ANN; thus, in addition to giving improved bounds, our results imply that list-of-points data structures are not optimal for Euclidean or Manhattan ANN .
similarity search
locality-sensitive hashing
randomized algorithms
data structures
space efficiency
function inversion
Theory of computation~Sorting and searching
Theory of computation~Data compression
88:1-88:19
Regular Paper
This work was funded in part by NSF CCF 2103813.
https://arxiv.org/abs/2407.02468
I would like to thank Rasmus Pagh and Martin Aumüller for helpful discussions, and the anonymous reviewers for their comments and suggestions.
Samuel
McCauley
Samuel McCauley
Williams College Computer Science, Williamstown, MA, USA
https://orcid.org/0000-0001-8196-9662
10.4230/LIPIcs.ESA.2024.88
Thomas D Ahle, Martin Aumüller, and Rasmus Pagh. Parameter-free locality sensitive hashing for spherical range reporting. In Proceedings of the Twenty-Eighth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 239-256. SIAM, 2017.
Thomas Dybdahl Ahle. Optimal las vegas locality sensitive data structures. In 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS), pages 938-949. IEEE, 2017.
Thomas Dybdahl Ahle, Rasmus Pagh, Ilya Razenshteyn, and Francesco Silvestri. On the complexity of inner product similarity join. In Proceedings of the 35th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, pages 151-164, 2016.
Josh Alman and Ryan Williams. Probabilistic polynomials and hamming nearest neighbors. In 2015 IEEE 56th Annual Symposium on Foundations of Computer Science, pages 136-150. IEEE, 2015.
Alexandr Andoni. Nearest neighbor search: the old, the new, and the impossible. PhD thesis, Massachusetts Institute of Technology, 2009.
Alexandr Andoni and Piotr Indyk. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. In 47th Annual Symposium on Foundations of Computer Science (FOCS), pages 459-468. IEEE, 2006.
Alexandr Andoni and Piotr Indyk. Nearest neighbors in high-dimensional spaces. In Handbook of Discrete and Computational Geometry, pages 1135-1155. Chapman and Hall/CRC, 2017.
Alexandr Andoni, Piotr Indyk, Thijs Laarhoven, Ilya Razenshteyn, and Ludwig Schmidt. Practical and optimal lsh for angular distance. Advances in neural information processing systems, 28, 2015.
Alexandr Andoni, Piotr Indyk, Huy L Nguyen, and Ilya Razenshteyn. Beyond locality-sensitive hashing. In Proceedings of the twenty-fifth annual ACM-SIAM symposium on Discrete algorithms, pages 1018-1028. SIAM, 2014.
Alexandr Andoni, Thijs Laarhoven, Ilya Razenshteyn, and Erik Waingarten. Lower bounds on time-space trade-offs for approximate near neighbors. arXiv preprint arXiv:1605.02701, 2016.
Alexandr Andoni, Thijs Laarhoven, Ilya Razenshteyn, and Erik Waingarten. Optimal hashing-based time-space trade-offs for approximate near neighbors. In Proceedings of the twenty-eighth annual ACM-SIAM symposium on discrete algorithms, pages 47-66. SIAM, 2017.
Alexandr Andoni, Thijs Laarhoven, Ilya P. Razenshteyn, and Erik Waingarten. Optimal hashing-based time-space trade-offs for approximate near neighbors. CoRR, 2016. URL: https://arxiv.org/abs/1608.03580.
https://arxiv.org/abs/1608.03580
Alexandr Andoni and Ilya Razenshteyn. Optimal data-dependent hashing for approximate near neighbors. In Proceedings of the forty-seventh annual ACM symposium on Theory of computing, pages 793-801, 2015.
Boris Aronov, Jean Cardinal, Justin Dallant, and John Iacono. A general technique for searching in implicit sets via function inversion. arXiv preprint arXiv:2311.12471, 2023.
Boris Aronov, Esther Ezra, Micha Sharir, and Guy Zigdon. Time and space efficient collinearity indexing. Computational Geometry, 110:101963, 2023.
Martin Aumüller, Erik Bernhardsson, and Alexander Faithfull. Ann-benchmarks: A benchmarking tool for approximate nearest neighbor algorithms. Information Systems, 87:101374, 2020.
Martin Aumuller, Sariel Har-Peled, Sepideh Mahabadi, Rasmus Pagh, and Francesco Silvestri. Fair near neighbor search via sampling. ACM SIGMOD Record, 50(1):42-49, 2021.
Martin Aumüller, Rasmus Pagh, and Francesco Silvestri. Fair near neighbor search: Independent range sampling in high dimensions. In Proceedings of the 39th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, pages 191-204, 2020.
Elad Barkan, Eli Biham, and Adi Shamir. Rigorous bounds on cryptanalytic time/memory tradeoffs. In Annual International Cryptology Conference, pages 1-21. Springer, 2006.
Pavel Berkhin. A survey of clustering data mining techniques. In Grouping multidimensional data: Recent advances in clustering, pages 25-71. Springer, 2006.
Philip Bille, Inge Li Gørtz, Moshe Lewenstein, Solon P Pissis, Eva Rotenberg, and Teresa Anna Steiner. Gapped string indexing in subquadratic space and sublinear query time. arXiv preprint arXiv:2211.16860, 2022.
Andrei Z Broder. On the resemblance and containment of documents. In Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No. 97TB100171), pages 21-29. IEEE, 1997.
Herman Chernoff. A measure of asymptotic efficiency for tests of a hypothesis based on the sum of observations. The Annals of Mathematical Statistics, pages 493-507, 1952.
Tobias Christiani. A framework for similarity search with space-time tradeoffs using locality-sensitive filtering. In Proceedings of the Twenty-Eighth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 31-46. SIAM, 2017.
Tobias Christiani and Rasmus Pagh. Set similarity search beyond minhash. In Proceedings of the 49th annual ACM SIGACT symposium on theory of computing, pages 1094-1107, 2017.
Henry Corrigan-Gibbs and Dmitry Kogan. The function-inversion problem: Barriers and opportunities. In Theory of Cryptography Conference, pages 393-421. Springer, 2019.
Thomas Cover and Peter Hart. Nearest neighbor pattern classification. IEEE transactions on information theory, 13(1):21-27, 1967.
Mayur Datar, Nicole Immorlica, Piotr Indyk, and Vahab S Mirrokni. Locality-sensitive hashing scheme based on p-stable distributions. In Proceedings of the twentieth annual symposium on Computational geometry, pages 253-262, 2004.
Anne Driemel and Francesco Silvestri. Locality-sensitive hashing of curves. In 33rd International Symposium on Computational Geometry (SoCG 2017). Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2017.
Amos Fiat and Moni Naor. Rigorous time/space trade-offs for inverting functions. SIAM Journal on Computing, 29(3):790-803, 2000.
Alexander Golovnev, Siyao Guo, Thibaut Horel, Sunoo Park, and Vinod Vaikuntanathan. Data structures meet cryptography: 3sum with preprocessing. In Proceedings of the 52nd annual ACM SIGACT symposium on theory of computing, pages 294-307, 2020.
Alexander Golovnev, Siyao Guo, Spencer Peters, and Noah Stephens-Davidowitz. Revisiting time-space tradeoffs for function inversion. In Annual International Cryptology Conference, pages 453-481. Springer, 2023.
David Gorisse, Matthieu Cord, and Frederic Precioso. Locality-sensitive hashing for chi2 distance. IEEE transactions on pattern analysis and machine intelligence, 34(2):402-409, 2011.
Christian Hachenberg and Thomas Gottron. Locality sensitive hashing for scalable structural classification and clustering of web documents. In Proceedings of the 22nd ACM international conference on Information & Knowledge Management, pages 359-368, 2013.
Sariel Har-Peled, Piotr Indyk, and Rajeev Motwani. Approximate nearest neighbor: Towards removing the curse of dimensionality. Theory OF Computing, 8:321-350, 2012.
Martin Hellman. A cryptanalytic time-memory trade-off. IEEE transactions on Information Theory, 26(4):401-406, 1980.
Piotr Indyk. High-dimensional computational geometry. PhD thesis, Stanford University, 2001.
Piotr Indyk and Rajeev Motwani. Approximate nearest neighbors: towards removing the curse of dimensionality. In Proceedings of the thirtieth annual ACM symposium on Theory of computing, pages 604-613, 1998.
Michael Kapralov. Smooth tradeoffs between insert and query complexity in nearest neighbor search. In Proceedings of the 34th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, pages 329-342, 2015.
Kifayat Ullah Khan, Batjargal Dolgorsuren, Tu Nguyen Anh, Waqas Nawaz, and Young-Koo Lee. Faster compression methods for a weighted graph using locality sensitive hashing. Information Sciences, 421:237-253, 2017.
Tsvi Kopelowitz and Ely Porat. The strong 3sum-indexing conjecture is false. arXiv preprint arXiv:1907.11206, 2019.
Thijs Laarhoven. Tradeoffs for nearest neighbors on the sphere. arXiv preprint arXiv:1511.07527, 2015.
Thijs Laarhoven. Graph-based time-space trade-offs for approximate near neighbors. In 34th International Symposium on Computational Geometry (SoCG 2018). Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2018.
Nathan Linial, Eran London, and Yuri Rabinovich. The geometry of graphs and some of its algorithmic applications. Combinatorica, 15:215-245, 1995.
Guillaume Marçais, Dan DeBlasio, Prashant Pandey, and Carl Kingsford. Locality-sensitive hashing for the edit distance. Bioinformatics, 35(14):i127-i135, 2019.
Samuel McCauley. Approximate similarity search under edit distance using locality-sensitive hashing. In 24th International Conference on Database Theory, 2021.
Samuel McCauley, Jesper W Mikkelsen, and Rasmus Pagh. Set similarity search for skewed data. In Proceedings of the 37th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, pages 63-74, 2018.
Marvin Minsky and Seymour A. Papert. Perceptrons: An Introduction to Computational Geometry. MIT Press, Cambridge, 1969.
Michael Mitzenmacher and Eli Upfal. Probability and computing: Randomization and probabilistic techniques in algorithms and data analysis. Cambridge university press, 2017.
Rina Panigrahy. Entropy based nearest neighbor search in high dimensions. In Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm, pages 1186-1195, 2006.
Aviad Rubinstein. Hardness of approximate nearest neighbor search. In Proceedings of the 50th annual ACM SIGACT symposium on theory of computing, pages 1260-1268, 2018.
Gerard Salton, Anita Wong, and Chung-Shu Yang. A vector space model for automatic indexing. Communications of the ACM, 18(11):613-620, 1975.
Francisco Santoyo, Edgar Chavez, and Eric S Tellez. Compressing locality sensitive hashing tables. In 2013 Mexican International Conference on Computer Science, pages 41-46. IEEE, 2013.
Ryan Williams. On the difference between closest, furthest, and orthogonal pairs: Nearly-linear vs barely-subquadratic complexity. In Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 1207-1215. SIAM, 2018.
Andrew Chi-Chih Yao. Coherent functions and program checkers. In Proceedings of the twenty-second annual ACM symposium on Theory of computing, pages 84-94, 1990.
Samuel McCauley
Creative Commons Attribution 4.0 International license
https://creativecommons.org/licenses/by/4.0/legalcode