Approximate Similarity Search Under Edit Distance Using Locality-Sensitive Hashing

McCauley, Samuel

doi:10.4230/LIPIcs.ICDT.2021.21

File

Cite AsGet BibTex

Samuel McCauley. Approximate Similarity Search Under Edit Distance Using Locality-Sensitive Hashing. In 24th International Conference on Database Theory (ICDT 2021). Leibniz International Proceedings in Informatics (LIPIcs), Volume 186, pp. 21:1-21:22, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021)
https://doi.org/10.4230/LIPIcs.ICDT.2021.21

Abstract

Edit distance similarity search, also called approximate pattern matching, is a fundamental problem with widespread database applications. The goal of the problem is to preprocess n strings of length d, to quickly answer queries q of the form: if there is a database string within edit distance r of q, return a database string within edit distance cr of q. Previous approaches to this problem either rely on very large (superconstant) approximation ratios c, or very small search radii r. Outside of a narrow parameter range, these solutions are not competitive with trivially searching through all n strings. In this work we give a simple and easy-to-implement hash function that can quickly answer queries for a wide range of parameters. Specifically, our strategy can answer queries in time Õ(d3^rn^{1/c}). The best known practical results require c ≫ r to achieve any correctness guarantee; meanwhile, the best known theoretical results are very involved and difficult to implement, and require query time that can be loosely bounded below by 24^r. Our results significantly broaden the range of parameters for which there exist nontrivial theoretical bounds, while retaining the practicality of a locality-sensitive hash function.

Subject Classification

ACM Subject Classification

Information systems → Nearest-neighbor search
Theory of computation → Pattern matching

Keywords

edit distance
approximate pattern matching
approximate nearest neighbor
similarity search
locality-sensitive hashing

Metrics

Access Statistics
Total Accesses (updated on a weekly basis)

0

PDF Downloads

0

Metadata Views

References

Thomas Dybdahl Ahle, Martin Aumüller, and Rasmus Pagh. Parameter-free locality sensitive hashing for spherical range reporting. In Proc. 28th Symposium on Discrete Algorithms (SODA), pages 239-256. SIAM, 2017.
Josh Alman and Ryan Williams. Probabilistic polynomials and Hamming nearest neighbors. In Proc. 56th Symposium on Foundations of Computer Science (FOCS), pages 136-150. IEEE, 2015.
Alexandr Andoni, Piotr Indyk, and Ilya Razenshteyn. Approximate nearest neighbor search in high dimensions. In Proc. International Congress of Mathematicians (ICM), pages 3271-3302, 2018.
Alexandr Andoni, Thijs Laarhoven, Ilya Razenshteyn, and Erik Waingarten. Optimal hashing-based time-space trade-offs for approximate near neighbors. In Proc. 28th Symposium on Discrete Algorithms (SODA), pages 47-66. ACM-SIAM, 2017.
Martin Aumüller, Erik Bernhardsson, and Alexander Faithfull. ANN-benchmarks: A benchmarking tool for approximate nearest neighbor algorithms. Information Systems, 2019.
Leonid Boytsov. Indexing methods for approximate dictionary searching: Comparative analysis. Journal of Experimental Algorithmics (JEA), 16:1-1, 2011.
Eric Brill and Robert C Moore. An improved error model for noisy channel spelling correction. In Proc. 38th Meeting on Association for Computational Linguistics, pages 286-293. Association for Computational Linguistics, 2000.
Diptarka Chakraborty, Elazar Goldenberg, and Michal Koucky. Streaming algorithms for embedding and computing edit distance in the low distance regime. In Proc. 48th Annual Symposium on Theory of Computing (STOC), pages 712-725. ACM, 2016.
Ho-Leung Chan, Tak-Wah Lam, Wing-Kin Sung, Siu-Lung Tam, and Swee-Seong Wong. A linear size index for approximate pattern matching. In Proc. 17th Symposium on Combinatorial Pattern Matching (CPM), pages 49-59. Springer, 2006.
Moses S Charikar. Similarity estimation techniques from rounding algorithms. In Proc. 34th Symposium on Theory of Computing (STOC), pages 380-388. ACM, 2002.
Flavio Chierichetti, Ravi Kumar, and Mohammad Mahdian. The complexity of LSH feasibility. Theoretical Computer Science, 530:89-101, 2014.
Tobias Christiani. A framework for similarity search with space-time tradeoffs using locality-sensitive filtering. In Proc. 28th Symposium on Discrete Algorithms (SODA), pages 31-46. Society for Industrial and Applied Mathematics, 2017.
Tobias Christiani and Rasmus Pagh. Set similarity search beyond minhash. In Proc. 49th Symposium on Theory of Computing (STOC), pages 1094-1107. ACM, 2017.
Vincent Cohen-Addad, Laurent Feuilloley, and Tatiana Starikovskaya. Lower bounds for text indexing with mismatches and differences. In Proc. 30th Symposium on Discrete Algorithms (SODA), pages 1146-1164. ACM-SIAM, 2019.
Richard Cole, Lee-Ad Gottlieb, and Moshe Lewenstein. Dictionary matching and indexing with errors and don't cares. In Proc. 36th ACM Symposium on Theory of Computing (STOC), pages 91-100. ACM, 2004.
Benjamin Coleman, Richard Baraniuk, and Anshumali Shrivastava. Sub-linear memory sketches for near neighbor search on streaming data. In Proc. 27th International Conference on Machine Learning (ICML), pages 2089-2099. PMLR, 2020.
Wei Dong, Charikar Moses, and Kai Li. Efficient k-nearest neighbor graph construction for generic similarity measures. In Proc. 20th Conference on the World Wide Web (WWW), pages 577-586. ACM, 2011.
Sariel Har-Peled, Piotr Indyk, and Rajeev Motwani. Approximate nearest neighbor: Towards removing the curse of dimensionality. Theory of computing, 8(1):321-350, 2012.
Piotr Indyk. Approximate nearest neighbor under edit distance via product metrics. In Proc. 15th Symposium on Discrete Algorithms (SODA), pages 646-650. ACM-SIAM, 2004.
Piotr Indyk and Rajeev Motwani. Approximate nearest neighbors: towards removing the curse of dimensionality. In Proc. 30th Symposium on Theory of Computing (STOC), pages 604-613. ACM, 1998.
Jeff Johnson, Matthijs Douze, and Hervé Jégou. Billion-scale similarity search with gpus. IEEE Transactions on Big Data, 2019.
Tamer Kahveci, Vebjorn Ljosa, and Ambuj K Singh. Speeding up whole-genome alignment by indexing frequency vectors. Bioinformatics, 20(13):2122-2134, 2004.
Subhash Khot and Assaf Naor. Nonembeddability theorems via fourier analysis. Mathematische Annalen, 334(4):821-852, 2006.
Tak Wah Lam, Wing-Kin Sung, Siu-Lung Tam, Chi-Kwong Wong, and Siu-Ming Yiu. Compressed indexing and local alignment of dna. Bioinformatics, 24(6):791-797, 2008.
Wen Li, Ying Zhang, Yifang Sun, Wei Wang, Wenjie Zhang, and Xuemin Lin. Approximate nearest neighbor search on high dimensional data - experiments, analyses, and improvement. Transactions on Knowledge and Data Engineering (TKDE), 32(8):1475-1488, 2019.
Moritz G Maaß and Johannes Nowak. Text indexing with errors. In Proc. 16th Symposium on Combinatorial Pattern Matching (CPM), pages 21-32. Springer, 2005.
Yury A Malkov and Dmitry A Yashunin. Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. IEEE transactions on pattern analysis and machine intelligence, 42(4):824-836, 2020.
Udi Manber and Sun Wu. An algorithm for approximate membership checking with application to password security. Information Processing Letters, 50(4):191-197, 1994.
Guillaume Marçais, Dan DeBlasio, Prashant Pandey, and Carl Kingsford. Locality sensitive hashing for the edit distance. Bioinformatics, 35(14):i127-i135, 2019.
Michael Mitzenmacher and Eli Upfal. Probability and computing: Randomization and probabilistic techniques in algorithms and data analysis. Cambridge university press, 2017.
Gonzalo Navarro. A guided tour to approximate string matching. ACM computing surveys (CSUR), 33(1):31-88, 2001.
Rafail Ostrovsky and Yuval Rabani. Low distortion embeddings for edit distance. Journal of the ACM (JACM), 54(5):23, 2007.
Ozgur Ozturk and Hakan Ferhatosmanoglu. Effective indexing and filtering for similarity search in large biosequence databases. In Proc. 3rd Symposium on Bioinformatics and Bioengineering, pages 359-366. IEEE, 2003.
Rasmus Pagh, Ninh Pham, Francesco Silvestri, and Morten Stöckel. I/O-efficient similarity join. Algorithmica, 78(4):1263-1283, 2017.
Aviad Rubinstein. Hardness of approximate nearest neighbor search. In Proc. 50th Symposium on Theory of Computing (STOC), pages 1260-1268. ACM, 2018.
Esko Ukkonen. Algorithms for approximate string matching. Information and control, 64(1-3):100-118, 1985.
Yiqiu Wang, Anshumali Shrivastava, Jonathan Wang, and Junghee Ryu. Randomized algorithms accelerated over cpu-gpu for ultra-high dimensional similarity search. In Proc. International Conference on Management of Data (SIGMOD), pages 889-903. ACM, 2018.
W John Wilbur, Won Kim, and Natalie Xie. Spelling correction in the pubmed search engine. Information retrieval, 9(5):543-564, 2006.
Haoyu Zhang and Qin Zhang. Embedjoin: Efficient edit similarity joins via embeddings. In Proc. 23rd International Conference on Knowledge Discovery and Data Mining (KDD), pages 585-594. ACM, 2017.
Haoyu Zhang and Qin Zhang. Minjoin: Efficient edit similarity joins via local hash minima. In Proc. 25th International Conference on Knowledge Discovery and Data Mining (KDD), pages 1093-1103. ACM, 2019.

Approximate Similarity Search Under Edit Distance Using Locality-Sensitive Hashing

Author Samuel McCauley

File

Document Identifiers

Author Details

Acknowledgements

Cite AsGet BibTex

Abstract

Subject Classification

ACM Subject Classification

Keywords

Metrics

References

Thanks for your feedback!

Could not send message

Approximate Similarity Search Under Edit Distance Using Locality-Sensitive Hashing

Author Samuel McCauley

File

Document Identifiers

Author Details

Funding

Acknowledgements

Cite AsGet BibTex

Abstract

Subject Classification

ACM Subject Classification

Keywords

Metrics

References

Thanks for your feedback!

Could not send message