PUFFINN: Parameterless and Universally Fast FInding of Nearest Neighbors

Authors Martin Aumüller, Tobias Christiani, Rasmus Pagh , Michael Vesterli



PDF
Thumbnail PDF

File

LIPIcs.ESA.2019.10.pdf
  • Filesize: 0.77 MB
  • 16 pages

Document Identifiers

Author Details

Martin Aumüller
  • IT University of Copenhagen, Denmark
Tobias Christiani
  • IT University of Copenhagen, Denmark
Rasmus Pagh
  • BARC and IT University of Copenhagen, Denmark
Michael Vesterli
  • IT University of Copenhagen, Denmark

Cite AsGet BibTex

Martin Aumüller, Tobias Christiani, Rasmus Pagh, and Michael Vesterli. PUFFINN: Parameterless and Universally Fast FInding of Nearest Neighbors. In 27th Annual European Symposium on Algorithms (ESA 2019). Leibniz International Proceedings in Informatics (LIPIcs), Volume 144, pp. 10:1-10:16, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2019)
https://doi.org/10.4230/LIPIcs.ESA.2019.10

Abstract

We present PUFFINN, a parameterless LSH-based index for solving the k-nearest neighbor problem with probabilistic guarantees. By parameterless we mean that the user is only required to specify the amount of memory the index is supposed to use and the result quality that should be achieved. The index combines several heuristic ideas known in the literature. By small adaptions to the query algorithm, we make heuristics rigorous. We perform experiments on real-world and synthetic inputs to evaluate implementation choices and show that the implementation satisfies the quality guarantees while being competitive with other state-of-the-art approaches to nearest neighbor search. We describe a novel synthetic data set that is difficult to solve for almost all existing nearest neighbor search approaches, and for which PUFFINN significantly outperform previous methods.

Subject Classification

ACM Subject Classification
  • Mathematics of computing → Probabilistic algorithms
  • Theory of computation → Design and analysis of algorithms
  • Theory of computation → Nearest neighbor algorithms
Keywords
  • Nearest Neighbor Search
  • Locality-Sensitive Hashing
  • Adaptive Similarity Search

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. A. Andoni, P. Indyk, T. Laarhoven, I. Razenshteyn, and L. Schmidt. Practical and optimal LSH for angular distance. In Proc. NIPS '15, pages 1225-1233, 2015. Google Scholar
  2. Alexandr Andoni and Piotr Indyk. E2LSH, user manual, 2005. Google Scholar
  3. Alexandr Andoni and Piotr Indyk. Efficient algorithms for substring near neighbor problem. In Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm, pages 1203-1212. Society for Industrial and Applied Mathematics, 2006. Google Scholar
  4. Martin Aumüller, Erik Bernhardsson, and Alexander Faithfull. ANN-Benchmarks: A Benchmarking Tool for Approximate Nearest Neighbor Algorithms. In SISAP'17, pages 34-49, 2017. URL: https://doi.org/10.1007/978-3-319-68474-1_3.
  5. Mayank Bawa, Tyson Condie, and Prasanna Ganesan. LSH forest: self-tuning indexes for similarity search. In Proceedings of 14th International Conference on World Wide Web (WWW), pages 651-660. ACM, 2005. Google Scholar
  6. Jon Louis Bentley. Multidimensional Binary Search Trees Used for Associative Searching. Commun. ACM, 18(9):509-517, 1975. Google Scholar
  7. Erik Bernhardsson. Annoy. URL: https://github.com/spotify/annoy.
  8. Leonid Boytsov and Bilegsaikhan Naidan. Engineering Efficient and Effective Non-metric Space Library. In SISAP'13, pages 280-293, 2013. URL: https://doi.org/10.1007/978-3-642-41062-8_28.
  9. Andrei Z. Broder. On the resemblance and containment of documents. In Compression and Complexity of Sequences 1997. Proceedings, pages 21-29. IEEE, 1997. Google Scholar
  10. Moses Charikar. Similarity estimation techniques from rounding algorithms. In Proceedings of 34th ACM Symposium on Theory of Computing (STOC), pages 380-388, 2002. URL: https://doi.org/10.1145/509907.509965.
  11. Tobias Christiani. A Framework for Similarity Search with Space-Time Tradeoffs using Locality-Sensitive Filtering. In Proc. 28th Symposium on Discrete Algorithms (SODA), pages 31-46, 2017. Google Scholar
  12. Tobias Christiani. Fast Locality-Sensitive Hashing for Approximate Near Neighbor Search. CoRR, abs/1708.07586, 2017. URL: http://arxiv.org/abs/1708.07586.
  13. Tobias Christiani, Rasmus Pagh, and Mikkel Thorup. Confirmation Sampling for Exact Nearest Neighbor Search. CoRR, abs/1812.02603, 2018. URL: http://arxiv.org/abs/1812.02603.
  14. Paolo Ciaccia, Marco Patella, and Pavel Zezula. M-tree: An Efficient Access Method for Similarity Search in Metric Spaces. In VLDB'97, pages 426-435, 1997. URL: http://www.vldb.org/conf/1997/P426.PDF.
  15. Søren Dahlgaard, Mathias Bæk Tejs Knudsen, and Mikkel Thorup. Fast Similarity Sketching. 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS), pages 663-671, 2017. Google Scholar
  16. Sanjoy Dasgupta and Yoav Freund. Random projection trees and low dimensional manifolds. In STOC'08, pages 537-546. ACM, 2008. Google Scholar
  17. Wei Dong. LSHkit. URL: http://lshkit.sourceforge.net/.
  18. Wei Dong, Zhe Wang, William Josephson, Moses Charikar, and Kai Li. Modeling LSH for performance tuning. In Proceedings of 17th ACM Conference on Information and Knowledge Management (CIKM), pages 669-678. ACM, 2008. Google Scholar
  19. Sariel Har-Peled, Piotr Indyk, and Rajeev Motwani. Approximate Nearest Neighbor: Towards Removing the Curse of Dimensionality. Theory of Computing, 8(1):321-350, 2012. URL: https://doi.org/10.4086/toc.2012.v008a014.
  20. Piotr Indyk and Rajeev Motwani. Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality. In Proceedings of 30th Annual ACM Symposium on the Theory of Computing (STOC), pages 604-613, 1998. URL: https://doi.org/10.1145/276698.276876.
  21. M. Iwasaki and D. Miyazaki. Optimization of Indexing Based on k-Nearest Neighbor Graph for Proximity Search in High-dimensional Data. ArXiv e-prints, October 2018. URL: http://arxiv.org/abs/1810.07355.
  22. Jeff Johnson, Matthijs Douze, and Hervé Jégou. Billion-scale similarity search with GPUs. CoRR, abs/1702.08734, 2017. URL: http://arxiv.org/abs/1702.08734.
  23. Ping Li and Arnd Christian König. Theory and applications of b-bit minwise hashing. Commun. ACM, 54(8):101-109, 2011. URL: https://doi.org/10.1145/1978542.1978566.
  24. Y. A. Malkov and D. A. Yashunin. Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs. ArXiv e-prints, March 2016. URL: http://arxiv.org/abs/1603.09320.
  25. Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient Estimation of Word Representations in Vector Space. CoRR, abs/1301.3781, 2013. URL: http://arxiv.org/abs/1301.3781.
  26. Marius Muja and David G. Lowe. Fast Approximate Nearest Neighbors with Automatic Algorithm Configuration. In VISSAPP'09, pages 331-340. INSTICC Press, 2009. Google Scholar
  27. Rasmus Pagh. Similarity Sketching. In Encyclopedia of Big Data Technologies. Springer, 2019. URL: https://doi.org/10.1007/978-3-319-63962-8_58-1.
  28. Jeffrey Pennington, Richard Socher, and Christopher D. Manning. GloVe: Global Vectors for Word Representation. In Empirical Methods in Natural Language Processing (EMNLP), pages 1532-1543, 2014. Google Scholar
  29. Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In Advances in neural information processing systems, pages 1177-1184, 2008. Google Scholar
  30. Venu Satuluri and Srinivasan Parthasarathy. Bayesian locality sensitive hashing for fast similarity search. Proceedings of the VLDB Endowment, 5(5):430-441, 2012. Google Scholar
  31. Kengo Terasawa and Yuzuru Tanaka. Spherical LSH for Approximate Nearest Neighbor Search on Unit Hypersphere. In Algorithms and Data Structures, 10th International Workshop, WADS 2007, Halifax, Canada, August 15-17, 2007, Proceedings, pages 27-38, 2007. URL: https://doi.org/10.1007/978-3-540-73951-7_4.
  32. Peter N. Yianilos. Data Structures and Algorithms for Nearest Neighbor Search in General Metric Spaces. In Proceedings of the Fourth Annual ACM/SIGACT-SIAM Symposium on Discrete Algorithms, 25-27 January 1993, Austin, Texas, USA., pages 311-321, 1993. URL: http://dl.acm.org/citation.cfm?id=313559.313789.
  33. Pavel Zezula, Pasquale Savino, Giuseppe Amato, and Fausto Rabitti. Approximate Similarity Retrieval with M-Trees. VLDB J., 7(4):275-293, 1998. URL: https://doi.org/10.1007/s007780050069.
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail