Geometric data structures have been extensively studied in the regime where the dimension is much smaller than the number of input points. But in many scenarios in Machine Learning, the dimension can be much higher than the number of points and can be so high that the data structure might be unable to read and store all coordinates of the input and query points. Inspired by these scenarios and related studies in feature selection and explainable clustering, we initiate the study of geometric data structures in this ultra-high dimensional regime. Our focus is the approximate nearest neighbor problem. In this problem, we are given a set of n points C ⊆ ℝ^d and have to produce a small data structure that can quickly answer the following query: given q ∈ ℝ^d, return a point c ∈ C that is approximately nearest to q, where the distance is under 𝓁₁, 𝓁₂, or other norms. Many groundbreaking (1+ε)-approximation algorithms have recently been discovered for 𝓁₁- and 𝓁₂-norm distances in the regime where d≪ n. The main question in this paper is: Is there a data structure with sublinear (o(nd)) space and sublinear (o(d)) query time when d≫ n? This question can be partially answered from the machine-learning literature: - For 𝓁₁-norm distances, an Õ(log(n))-approximation data structure with Õ(n log d) space and O(n) query time can be obtained from explainable clustering techniques [Dasgupta et al. ICML'20; Makarychev and Shan ICML'21; Esfandiari, Mirrokni, and Narayanan SODA'22; Gamlath et al. NeurIPS'21; Charikar and Hu SODA'22]. - For 𝓁₂-norm distances, a (√3+ε)-approximation data structure with Õ(n log(d)/poly(ε)) space and Õ(n/poly(ε)) query time can be obtained from feature selection techniques [Boutsidis, Drineas, and Mahoney NeurIPS'09; Boutsidis et al. IEEE Trans. Inf. Theory'15; Cohen et al. STOC'15]. - For 𝓁_p-norm distances, a O(n^{p-1}log²(n))-approximation data structure with O(nlog(n) + nlog(d)) space and O(n) query time can be obtained from the explainable clustering algorithms of [Gamlath et al. NeurIPS'21]. An important open problem is whether a (1+ε)-approximation data structure exists. This is not known for any norm, even with higher (e.g. poly(n)⋅ o(d)) space and query time. In this paper, we answer this question affirmatively. We present (1+ε)-approximation data structures with the following guarantees. - For 𝓁₁- and 𝓁₂-norm distances: Õ(n log(d)/poly(ε)) space and Õ(n/poly(ε)) query time. We show that these space and time bounds are tight up to poly (log n/ε) factors. - For 𝓁_p-norm distances: Õ(n² log(d) (log log(n)/ε)^p) space and Õ (n(log log(n)/ε)^p) query time. Via simple reductions, our data structures imply sublinear-in-d data structures for some other geometric problems; e.g. approximate orthogonal range search (in the style of [Arya and Mount SoCG'95]), furthest neighbor, and give rise to a sublinear O(1)-approximate representation of k-median and k-means clustering. We hope that this paper inspires future work on sublinear geometric data structures.
@InProceedings{herold_et_al:LIPIcs.SoCG.2025.56, author = {Herold, Martin G. and Nanongkai, Danupon and Spoerhase, Joachim and Varma, Nithin and Wu, Zihang}, title = {{Sublinear Data Structures for Nearest Neighbor in Ultra High Dimensions}}, booktitle = {41st International Symposium on Computational Geometry (SoCG 2025)}, pages = {56:1--56:15}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-370-6}, ISSN = {1868-8969}, year = {2025}, volume = {332}, editor = {Aichholzer, Oswin and Wang, Haitao}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.SoCG.2025.56}, URN = {urn:nbn:de:0030-drops-232087}, doi = {10.4230/LIPIcs.SoCG.2025.56}, annote = {Keywords: sublinear data structure, approximate nearest neighbor} }
Feedback for Dagstuhl Publishing