Sublinear Data Structures for Nearest Neighbor in Ultra High Dimensions

Herold, Martin G.; Nanongkai, Danupon; Spoerhase, Joachim; Varma, Nithin; Wu, Zihang

doi:10.4230/LIPIcs.SoCG.2025.56

Abstract

Geometric data structures have been extensively studied in the regime where the dimension is much smaller than the number of input points. But in many scenarios in Machine Learning, the dimension can be much higher than the number of points and can be so high that the data structure might be unable to read and store all coordinates of the input and query points. 
Inspired by these scenarios and related studies in feature selection and explainable clustering, we initiate the study of geometric data structures in this ultra-high dimensional regime. Our focus is the approximate nearest neighbor problem. 
In this problem, we are given a set of n points C ⊆ ℝ^d and have to produce a small data structure that can quickly answer the following query: given q ∈ ℝ^d, return a point c ∈ C that is approximately nearest to q, where the distance is under 𝓁₁, 𝓁₂, or other norms. Many groundbreaking (1+ε)-approximation algorithms have recently been discovered for 𝓁₁- and 𝓁₂-norm distances in the regime where d≪ n.
The main question in this paper is: Is there a data structure with sublinear (o(nd)) space and sublinear (o(d)) query time when d≫ n? This question can be partially answered from the machine-learning literature: 
- For 𝓁₁-norm distances, an Õ(log(n))-approximation data structure with Õ(n log d) space and O(n) query time can be obtained from explainable clustering techniques [Dasgupta et al. ICML'20; Makarychev and Shan ICML'21; Esfandiari, Mirrokni, and Narayanan SODA'22; Gamlath et al. NeurIPS'21; Charikar and Hu SODA'22]. 
- For 𝓁₂-norm distances, a (√3+ε)-approximation data structure with Õ(n log(d)/poly(ε)) space and Õ(n/poly(ε)) query time can be obtained from feature selection techniques [Boutsidis, Drineas, and Mahoney NeurIPS'09; Boutsidis et al. IEEE Trans. Inf. Theory'15; Cohen et al. STOC'15]. 
- For 𝓁_p-norm distances, a O(n^{p-1}log²(n))-approximation data structure with O(nlog(n) + nlog(d)) space and O(n) query time can be obtained from the explainable clustering algorithms of [Gamlath et al. NeurIPS'21]. 
An important open problem is whether a (1+ε)-approximation data structure exists. This is not known for any norm, even with higher (e.g. poly(n)⋅ o(d)) space and query time.
In this paper, we answer this question affirmatively. We present (1+ε)-approximation data structures with the following guarantees.  
- For 𝓁₁- and 𝓁₂-norm distances: Õ(n log(d)/poly(ε)) space and Õ(n/poly(ε)) query time. We show that these space and time bounds are tight up to poly (log n/ε) factors. 
- For 𝓁_p-norm distances: Õ(n² log(d) (log log(n)/ε)^p) space and Õ (n(log log(n)/ε)^p) query time. 
Via simple reductions, our data structures imply sublinear-in-d data structures for some other geometric problems; e.g. approximate orthogonal range search (in the style of [Arya and Mount SoCG'95]), furthest neighbor, and give rise to a sublinear O(1)-approximate representation of k-median and k-means clustering. We hope that this paper inspires future work on sublinear geometric data structures.

Miklós Ajtai, Vitaly Feldman, Avinatan Hassidim, and Jelani Nelson. Sorting and selection with imprecise comparisons. ACM Trans. Algorithms, 12(2):19:1-19:19, 2016. URL: https://doi.org/10.1145/2701427.
Alexandr Andoni and Piotr Indyk. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. Commun. ACM, 51(1):117-122, 2008. URL: https://doi.org/10.1145/1327452.1327494.
Alexandr Andoni, Piotr Indyk, Huy L. Nguyen, and Ilya P. Razenshteyn. Beyond locality-sensitive hashing. In Proceedings of the Twenty-Fifth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA'14), pages 1018-1028. SIAM, 2014. URL: https://doi.org/10.1137/1.9781611973402.76.
Alexandr Andoni, Huy L. Nguyen, Aleksandar Nikolov, Ilya P. Razenshteyn, and Erik Waingarten. Approximate near neighbors for general symmetric norms. In (STOC'17), pages 902-913. ACM, 2017. URL: https://doi.org/10.1145/3055399.3055418.
Alexandr Andoni and Ilya P. Razenshteyn. Optimal data-dependent hashing for approximate near neighbors. In (STOC'15), pages 793-801. ACM, 2015. URL: https://doi.org/10.1145/2746539.2746553.
Sunil Arya and David M. Mount. Approximate range searching. Comput. Geom., 17(3-4):135-152, 2000. URL: https://doi.org/10.1016/S0925-7721(00)00022-5.
Jon Louis Bentley. Multidimensional divide-and-conquer. Commun. ACM, 23(4):214-229, 1980. URL: https://doi.org/10.1145/358841.358850.
Alina Beygelzimer, Sham M. Kakade, and John Langford. Cover trees for nearest neighbor. In Machine Learning, Proceedings of the Twenty-Third International Conference (ICML'06), volume 148 of ACM International Conference Proceeding Series, pages 97-104. ACM, 2006. URL: https://doi.org/10.1145/1143844.1143857.
Christos Boutsidis, Petros Drineas, and Michael W Mahoney. Unsupervised feature selection for the k-means clustering problem. In Advances in Neural Information Processing Systems(NeurIPS'09), volume 22. Curran Associates, Inc., 2009.
Christos Boutsidis, Anastasios Zouzias, Michael W. Mahoney, and Petros Drineas. Randomized dimensionality reduction for k-means clustering. IEEE Trans. Inf. Theory, 61(2):1045-1062, 2015. URL: https://doi.org/10.1109/TIT.2014.2375327.
Moses Charikar and Lunjia Hu. Near-optimal explainable k-means for all dimensions. In Proceedings of the 2022 ACM-SIAM Symposium on Discrete Algorithms, (SODA'22), pages 2580-2606. SIAM, 2022. URL: https://doi.org/10.1137/1.9781611977073.101.
Bernard Chazelle. Filtering search: A new approach to query-answering. SIAM J. Comput., 15(3):703-724, 1986. URL: https://doi.org/10.1137/0215051.
Bernard Chazelle. A functional approach to data structures and its use in multidimensional searching. SIAM J. Comput., 17(3):427-462, 1988. URL: https://doi.org/10.1137/0217026.
Bernard Chazelle, Ding Liu, and Avner Magen. Approximate range searching in higher dimension. Comput. Geom., 39(1):24-29, 2008. URL: https://doi.org/10.1016/J.COMGEO.2007.05.008.
Kenneth L. Clarkson. Nearest neighbor queries in metric spaces. Discret. Comput. Geom., 22(1):63-93, 1999. URL: https://doi.org/10.1007/PL00009449.
Michael B. Cohen, Sam Elder, Cameron Musco, Christopher Musco, and Madalina Persu. Dimensionality reduction for k-means clustering and low rank approximation. In Proceedings of the Forty-Seventh Annual ACM on Symposium on Theory of Computing, (STOC'15), pages 163-172. ACM, 2015. URL: https://doi.org/10.1145/2746539.2746569.
Guilherme Dias da Fonseca and David M. Mount. Approximate range searching: The absolute model. Comput. Geom., 43(4):434-444, 2010. URL: https://doi.org/10.1016/J.COMGEO.2008.09.009.
Hossein Esfandiari, Vahab S. Mirrokni, and Shyam Narayanan. Almost tight approximation algorithms for explainable clustering. In Proceedings of the 2022 ACM-SIAM Symposium on Discrete Algorithms, (SODA'22), pages 2641-2663. SIAM, 2022. URL: https://doi.org/10.1137/1.9781611977073.103.
Buddhima Gamlath, Xinrui Jia, Adam Polak, and Ola Svensson. Nearly-tight and oblivious algorithms for explainable clustering. In Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, (NeurIPS'21), pages 28929-28939, 2021. URL: https://proceedings.neurips.cc/paper/2021/hash/f24ad6f72d6cc4cb51464f2b29ab69d3-Abstract.html.
Anupam Gupta, Madhusudhan Reddy Pittu, Ola Svensson, and Rachel Yuan. The price of explainability for clustering. In 64th IEEE Annual Symposium on Foundations of Computer Science, (FOCS'23), pages 1131-1148. IEEE, 2023. URL: https://doi.org/10.1109/FOCS57990.2023.00067.
Piotr Indyk. Stable distributions, pseudorandom generators, embeddings, and data stream computation. J. ACM, 53(3):307-323, 2006. URL: https://doi.org/10.1145/1147954.1147955.
Piotr Indyk and Rajeev Motwani. Approximate nearest neighbors: Towards removing the curse of dimensionality. In Proceedings of the Thirtieth Annual ACM Symposium on the Theory of Computing (STOC'98), pages 604-613. ACM, 1998. URL: https://doi.org/10.1145/276698.276876.
Piotr Indyk and Tal Wagner. Approximate nearest neighbors in limited space. In Proc. Conference On Learning Theory (COLT'18), volume 75, pages 2012-2036, 2018. URL: http://proceedings.mlr.press/v75/indyk18a.html.
William B. Johnson and Joram Lindenstrauss. Extensions of Lipschitz mappings into a Hilbert space. In Conference in modern analysis and probability (New Haven, Conn., 1982), volume 26 of Contemp. Math., pages 189-206. Amer. Math. Soc., Providence, RI, 1984.
David R. Karger and Matthias Ruhl. Finding nearest neighbors in growth-restricted metrics. In Proceedings on 34th Annual ACM Symposium on Theory of Computing (STOC'02), pages 741-750. ACM, 2002. URL: https://doi.org/10.1145/509907.510013.
Robert Krauthgamer and James R. Lee. Navigating nets: simple algorithms for proximity search. In Proceedings of the Fifteenth Annual ACM-SIAM Symposium on Discrete Algorithms, (SODA'04), pages 798-807. SIAM, 2004. URL: http://dl.acm.org/citation.cfm?id=982792.982913.
Konstantin Makarychev and Liren Shan. Near-optimal algorithms for explainable k-medians and k-means. In (ICML'21), volume 139 of Proceedings of Machine Learning Research, pages 7358-7367. PMLR, 2021. URL: http://proceedings.mlr.press/v139/makarychev21a.html.
Konstantin Makarychev and Liren Shan. Explainable k-means: don't be greedy, plant bigger trees! In Proc. 54th Annual ACM SIGACT Symposium on Theory of Computing (STOC'22), pages 1629-1642, 2022. URL: https://doi.org/10.1145/3519935.3520056.
Konstantin Makarychev and Liren Shan. Random cuts are optimal for explainable k-medians. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors, Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, (NeurIPS'23), 2023. URL: http://papers.nips.cc/paper_files/paper/2023/hash/d3408794e41dd23e34634344d662f5e9-Abstract-Conference.html.
Edward M. McCreight. Priority search trees. SIAM J. Comput., 14(2):257-276, 1985. URL: https://doi.org/10.1137/0214021.
Michal Moshkovitz, Sanjoy Dasgupta, Cyrus Rashtchian, and Nave Frost. Explainable k-means and k-medians clustering. In (ICML'20), volume 119 of Proceedings of Machine Learning Research, pages 7055-7065. PMLR, 2020. URL: http://proceedings.mlr.press/v119/moshkovitz20a.html.

Sublinear Data Structures for Nearest Neighbor in Ultra High Dimensions

Authors Martin G. Herold , Danupon Nanongkai , Joachim Spoerhase , Nithin Varma , Zihang Wu

File

Document Identifiers

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

References

Thanks for your feedback!

Could not send message

Sublinear Data Structures for Nearest Neighbor in Ultra High Dimensions

Authors Martin G. Herold , Danupon Nanongkai , Joachim Spoerhase , Nithin Varma , Zihang Wu

File

Document Identifiers

Related Versions

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

Funding

References

Thanks for your feedback!

Could not send message