Algorithm Engineering for High-Dimensional Similarity Search Problems (Invited Talk)

Aumüller, Martin

doi:10.4230/LIPIcs.SEA.2020.1

File

Subject Classification

ACM Subject Classification

Theory of computation → Nearest neighbor algorithms
Computing methodologies → Simulation evaluation
Theory of computation → Computational geometry

Keywords

Nearest neighbor search
Benchmarking

Metrics

Access Statistics
Total Accesses (updated on a weekly basis)

0

PDF Downloads

0

Metadata Views

Abstract

Similarity search problems in high-dimensional data arise in many areas of computer science such as data bases, image analysis, machine learning, and natural language processing. One of the most prominent problems is finding the k nearest neighbors of a data point q ∈ ℝ^d in a large set of data points S ⊂ ℝ^d, under same distance measure such as Euclidean distance. In contrast to lower dimensional settings, we do not know of worst-case efficient data structures for such search problems in high-dimensional data, i.e., data structures that are faster than a linear scan through the data set. However, there is a rich body of (often heuristic) approaches that solve nearest neighbor search problems much faster than such a scan on many real-world data sets. As a necessity, the term solve means that these approaches give approximate results that are close to the true k-nearest neighbors. In this talk, we survey recent approaches to nearest neighbor search and related problems. The talk consists of three parts: (1) What makes nearest neighbor search difficult? (2) How do current state-of-the-art algorithms work? (3) What are recent advances regarding similarity search on GPUs, in distributed settings, or in external memory?

Cite As Get BibTex

Martin Aumüller. Algorithm Engineering for High-Dimensional Similarity Search Problems (Invited Talk). In 18th International Symposium on Experimental Algorithms (SEA 2020). Leibniz International Proceedings in Informatics (LIPIcs), Volume 160, pp. 1:1-1:3, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2020) https://doi.org/10.4230/LIPIcs.SEA.2020.1

Author Details

Martin Aumüller

IT University of Copenhagen, Denmark

References

Thomas D. Ahle, Martin Aumüller, and Rasmus Pagh. Parameter-free locality sensitive hashing for spherical range reporting. In SODA, pages 239-256. SIAM, 2017.
Martin Aumüller, Erik Bernhardsson, and Alexander John Faithfull. ANN-Benchmarks: A benchmarking tool for approximate nearest neighbor algorithms. Inf. Syst., 87, 2020. See https://arxiv.org/abs/1807.05614 for an open access version.
Martin Aumüller and Matteo Ceccarello. The role of local intrinsic dimensionality in benchmarking nearest neighbor search. In SISAP, volume 11807 of Lecture Notes in Computer Science, pages 113-127. Springer, 2019.
Martin Aumüller, Tobias Christiani, Rasmus Pagh, and Michael Vesterli. PUFFINN: parameterless and universally fast finding of nearest neighbors. In ESA, volume 144 of LIPIcs, pages 10:1-10:16. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2019. URL: https://github.com/puffinn/puffinn.
Kevin S. Beyer, Jonathan Goldstein, Raghu Ramakrishnan, and Uri Shaft. When is "nearest neighbor" meaningful? In ICDT, volume 1540 of Lecture Notes in Computer Science, pages 217-235. Springer, 1999.
Tobias Christiani. Fast locality-sensitive hashing frameworks for approximate near neighbor search. In SISAP, volume 11807 of Lecture Notes in Computer Science, pages 3-17. Springer, 2019.
Yihe Dong, Piotr Indyk, Ilya Razenshteyn, and Tal Wagner. Learning space partitions for nearest neighbor search. In International Conference on Learning Representations, 2020.
Fabian Fier, Nikolaus Augsten, Panagiotis Bouros, Ulf Leser, and Johann-Christoph Freytag. Set similarity joins on MapReduce: An experimental survey. PVLDB, 11(10):1110-1122, 2018.
Fabian Groh, Lukas Ruppert, Patrick Wieschollek, and Hendrik Lensch. Ggnn: Graph-based gpu nearest neighbor search. arXiv preprint arXiv:1912.01059, 2019.
Junfeng He, Sanjiv Kumar, and Shih-Fu Chang. On the difficulty of nearest neighbor search. In ICML. icml.cc / Omnipress, 2012.
Xiao Hu, Ke Yi, and Yufei Tao. Output-optimal massively parallel algorithms for similarity joins. ACM Transactions on Database Systems, 44(2):1-36, April 2019. URL: https://doi.org/10.1145/3311967.
Piotr Indyk and Rajeev Motwani. Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality. In Proceedings of the Thirtieth Annual ACM Symposium on Theory of Computing, STOC '98, pages 604-613, New York, NY, USA, 1998. ACM. URL: https://doi.org/10.1145/276698.276876.
M. Iwasaki and D. Miyazaki. Optimization of Indexing Based on k-Nearest Neighbor Graph for Proximity Search in High-dimensional Data. ArXiv e-prints, October 2018. URL: http://arxiv.org/abs/1810.07355.
Tim Kraska, Alex Beutel, Ed H Chi, Jeffrey Dean, and Neoklis Polyzotis. The case for learned index structures. In Proceedings of the 2018 International Conference on Management of Data, pages 489-504, 2018.
Thijs Laarhoven. Graph-based time-space trade-offs for approximate near neighbors. In Symposium on Computational Geometry, volume 99 of LIPIcs, pages 57:1-57:14. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2018.
Yury A. Malkov and D. A. Yashunin. Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. IEEE Trans. Pattern Anal. Mach. Intell., 42(4):824-836, 2020.
Willi Mann, Nikolaus Augsten, and Panagiotis Bouros. An empirical evaluation of set similarity join techniques. Proceedings of the VLDB Endowment, 9(9):636-647, May 2016. URL: https://doi.org/10.14778/2947618.2947620.
Samuel McCauley and Francesco Silvestri. Adaptive MapReduce similarity joins. In BeyondMR@SIGMOD, pages 4:1-4:4. ACM, 2018.
Milos Radovanovic. Hubs in nearest-neighbor graphs: Origins, applications and challenges. In WIMS, pages 5:1-5:4. ACM, 2018.
Suhas Jayaram Subramanya, Devvrit, Rohan Kadekodi, Ravishankar Krishnaswamy, and Harsha Simhadri. DiskANN: Fast accurate billion-point nearest neighbor search on a single node. In NeurIPS 2019, November 2019. URL: https://www.microsoft.com/en-us/research/publication/diskann-fast-accurate-billion-point-nearest-neighbor-search-on-a-single-node/.

Algorithm Engineering for High-Dimensional Similarity Search Problems (Invited Talk)

Author Martin Aumüller

File

Document Identifiers

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

References

Thanks for your feedback!

Could not send message