Approximate Similarity Search Under Edit Distance Using Locality-Sensitive Hashing

Author Samuel McCauley



PDF
Thumbnail PDF

File

LIPIcs.ICDT.2021.21.pdf
  • Filesize: 0.76 MB
  • 22 pages

Document Identifiers

Author Details

Samuel McCauley
  • Williams College, Williamstown, MA, USA

Acknowledgements

I would like to thank Thomas Ahle, Martin Aumüller, Tobias Christiani, Tsvi Kopelowitz, Rasmus Pagh, Ely Porat, Francesco Silvestri, and Shikha Singh for many helpful comments and discussions.

Cite AsGet BibTex

Samuel McCauley. Approximate Similarity Search Under Edit Distance Using Locality-Sensitive Hashing. In 24th International Conference on Database Theory (ICDT 2021). Leibniz International Proceedings in Informatics (LIPIcs), Volume 186, pp. 21:1-21:22, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021)
https://doi.org/10.4230/LIPIcs.ICDT.2021.21

Abstract

Edit distance similarity search, also called approximate pattern matching, is a fundamental problem with widespread database applications. The goal of the problem is to preprocess n strings of length d, to quickly answer queries q of the form: if there is a database string within edit distance r of q, return a database string within edit distance cr of q. Previous approaches to this problem either rely on very large (superconstant) approximation ratios c, or very small search radii r. Outside of a narrow parameter range, these solutions are not competitive with trivially searching through all n strings. In this work we give a simple and easy-to-implement hash function that can quickly answer queries for a wide range of parameters. Specifically, our strategy can answer queries in time Õ(d3^rn^{1/c}). The best known practical results require c ≫ r to achieve any correctness guarantee; meanwhile, the best known theoretical results are very involved and difficult to implement, and require query time that can be loosely bounded below by 24^r. Our results significantly broaden the range of parameters for which there exist nontrivial theoretical bounds, while retaining the practicality of a locality-sensitive hash function.

Subject Classification

ACM Subject Classification
  • Information systems → Nearest-neighbor search
  • Theory of computation → Pattern matching
Keywords
  • edit distance
  • approximate pattern matching
  • approximate nearest neighbor
  • similarity search
  • locality-sensitive hashing

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Thomas Dybdahl Ahle, Martin Aumüller, and Rasmus Pagh. Parameter-free locality sensitive hashing for spherical range reporting. In Proc. 28th Symposium on Discrete Algorithms (SODA), pages 239-256. SIAM, 2017. Google Scholar
  2. Josh Alman and Ryan Williams. Probabilistic polynomials and Hamming nearest neighbors. In Proc. 56th Symposium on Foundations of Computer Science (FOCS), pages 136-150. IEEE, 2015. Google Scholar
  3. Alexandr Andoni, Piotr Indyk, and Ilya Razenshteyn. Approximate nearest neighbor search in high dimensions. In Proc. International Congress of Mathematicians (ICM), pages 3271-3302, 2018. Google Scholar
  4. Alexandr Andoni, Thijs Laarhoven, Ilya Razenshteyn, and Erik Waingarten. Optimal hashing-based time-space trade-offs for approximate near neighbors. In Proc. 28th Symposium on Discrete Algorithms (SODA), pages 47-66. ACM-SIAM, 2017. Google Scholar
  5. Martin Aumüller, Erik Bernhardsson, and Alexander Faithfull. ANN-benchmarks: A benchmarking tool for approximate nearest neighbor algorithms. Information Systems, 2019. Google Scholar
  6. Leonid Boytsov. Indexing methods for approximate dictionary searching: Comparative analysis. Journal of Experimental Algorithmics (JEA), 16:1-1, 2011. Google Scholar
  7. Eric Brill and Robert C Moore. An improved error model for noisy channel spelling correction. In Proc. 38th Meeting on Association for Computational Linguistics, pages 286-293. Association for Computational Linguistics, 2000. Google Scholar
  8. Diptarka Chakraborty, Elazar Goldenberg, and Michal Koucky. Streaming algorithms for embedding and computing edit distance in the low distance regime. In Proc. 48th Annual Symposium on Theory of Computing (STOC), pages 712-725. ACM, 2016. Google Scholar
  9. Ho-Leung Chan, Tak-Wah Lam, Wing-Kin Sung, Siu-Lung Tam, and Swee-Seong Wong. A linear size index for approximate pattern matching. In Proc. 17th Symposium on Combinatorial Pattern Matching (CPM), pages 49-59. Springer, 2006. Google Scholar
  10. Moses S Charikar. Similarity estimation techniques from rounding algorithms. In Proc. 34th Symposium on Theory of Computing (STOC), pages 380-388. ACM, 2002. Google Scholar
  11. Flavio Chierichetti, Ravi Kumar, and Mohammad Mahdian. The complexity of LSH feasibility. Theoretical Computer Science, 530:89-101, 2014. Google Scholar
  12. Tobias Christiani. A framework for similarity search with space-time tradeoffs using locality-sensitive filtering. In Proc. 28th Symposium on Discrete Algorithms (SODA), pages 31-46. Society for Industrial and Applied Mathematics, 2017. Google Scholar
  13. Tobias Christiani and Rasmus Pagh. Set similarity search beyond minhash. In Proc. 49th Symposium on Theory of Computing (STOC), pages 1094-1107. ACM, 2017. Google Scholar
  14. Vincent Cohen-Addad, Laurent Feuilloley, and Tatiana Starikovskaya. Lower bounds for text indexing with mismatches and differences. In Proc. 30th Symposium on Discrete Algorithms (SODA), pages 1146-1164. ACM-SIAM, 2019. Google Scholar
  15. Richard Cole, Lee-Ad Gottlieb, and Moshe Lewenstein. Dictionary matching and indexing with errors and don't cares. In Proc. 36th ACM Symposium on Theory of Computing (STOC), pages 91-100. ACM, 2004. Google Scholar
  16. Benjamin Coleman, Richard Baraniuk, and Anshumali Shrivastava. Sub-linear memory sketches for near neighbor search on streaming data. In Proc. 27th International Conference on Machine Learning (ICML), pages 2089-2099. PMLR, 2020. Google Scholar
  17. Wei Dong, Charikar Moses, and Kai Li. Efficient k-nearest neighbor graph construction for generic similarity measures. In Proc. 20th Conference on the World Wide Web (WWW), pages 577-586. ACM, 2011. Google Scholar
  18. Sariel Har-Peled, Piotr Indyk, and Rajeev Motwani. Approximate nearest neighbor: Towards removing the curse of dimensionality. Theory of computing, 8(1):321-350, 2012. Google Scholar
  19. Piotr Indyk. Approximate nearest neighbor under edit distance via product metrics. In Proc. 15th Symposium on Discrete Algorithms (SODA), pages 646-650. ACM-SIAM, 2004. Google Scholar
  20. Piotr Indyk and Rajeev Motwani. Approximate nearest neighbors: towards removing the curse of dimensionality. In Proc. 30th Symposium on Theory of Computing (STOC), pages 604-613. ACM, 1998. Google Scholar
  21. Jeff Johnson, Matthijs Douze, and Hervé Jégou. Billion-scale similarity search with gpus. IEEE Transactions on Big Data, 2019. Google Scholar
  22. Tamer Kahveci, Vebjorn Ljosa, and Ambuj K Singh. Speeding up whole-genome alignment by indexing frequency vectors. Bioinformatics, 20(13):2122-2134, 2004. Google Scholar
  23. Subhash Khot and Assaf Naor. Nonembeddability theorems via fourier analysis. Mathematische Annalen, 334(4):821-852, 2006. Google Scholar
  24. Tak Wah Lam, Wing-Kin Sung, Siu-Lung Tam, Chi-Kwong Wong, and Siu-Ming Yiu. Compressed indexing and local alignment of dna. Bioinformatics, 24(6):791-797, 2008. Google Scholar
  25. Wen Li, Ying Zhang, Yifang Sun, Wei Wang, Wenjie Zhang, and Xuemin Lin. Approximate nearest neighbor search on high dimensional data - experiments, analyses, and improvement. Transactions on Knowledge and Data Engineering (TKDE), 32(8):1475-1488, 2019. Google Scholar
  26. Moritz G Maaß and Johannes Nowak. Text indexing with errors. In Proc. 16th Symposium on Combinatorial Pattern Matching (CPM), pages 21-32. Springer, 2005. Google Scholar
  27. Yury A Malkov and Dmitry A Yashunin. Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. IEEE transactions on pattern analysis and machine intelligence, 42(4):824-836, 2020. Google Scholar
  28. Udi Manber and Sun Wu. An algorithm for approximate membership checking with application to password security. Information Processing Letters, 50(4):191-197, 1994. Google Scholar
  29. Guillaume Marçais, Dan DeBlasio, Prashant Pandey, and Carl Kingsford. Locality sensitive hashing for the edit distance. Bioinformatics, 35(14):i127-i135, 2019. Google Scholar
  30. Michael Mitzenmacher and Eli Upfal. Probability and computing: Randomization and probabilistic techniques in algorithms and data analysis. Cambridge university press, 2017. Google Scholar
  31. Gonzalo Navarro. A guided tour to approximate string matching. ACM computing surveys (CSUR), 33(1):31-88, 2001. Google Scholar
  32. Rafail Ostrovsky and Yuval Rabani. Low distortion embeddings for edit distance. Journal of the ACM (JACM), 54(5):23, 2007. Google Scholar
  33. Ozgur Ozturk and Hakan Ferhatosmanoglu. Effective indexing and filtering for similarity search in large biosequence databases. In Proc. 3rd Symposium on Bioinformatics and Bioengineering, pages 359-366. IEEE, 2003. Google Scholar
  34. Rasmus Pagh, Ninh Pham, Francesco Silvestri, and Morten Stöckel. I/O-efficient similarity join. Algorithmica, 78(4):1263-1283, 2017. Google Scholar
  35. Aviad Rubinstein. Hardness of approximate nearest neighbor search. In Proc. 50th Symposium on Theory of Computing (STOC), pages 1260-1268. ACM, 2018. Google Scholar
  36. Esko Ukkonen. Algorithms for approximate string matching. Information and control, 64(1-3):100-118, 1985. Google Scholar
  37. Yiqiu Wang, Anshumali Shrivastava, Jonathan Wang, and Junghee Ryu. Randomized algorithms accelerated over cpu-gpu for ultra-high dimensional similarity search. In Proc. International Conference on Management of Data (SIGMOD), pages 889-903. ACM, 2018. Google Scholar
  38. W John Wilbur, Won Kim, and Natalie Xie. Spelling correction in the pubmed search engine. Information retrieval, 9(5):543-564, 2006. Google Scholar
  39. Haoyu Zhang and Qin Zhang. Embedjoin: Efficient edit similarity joins via embeddings. In Proc. 23rd International Conference on Knowledge Discovery and Data Mining (KDD), pages 585-594. ACM, 2017. Google Scholar
  40. Haoyu Zhang and Qin Zhang. Minjoin: Efficient edit similarity joins via local hash minima. In Proc. 25th International Conference on Knowledge Discovery and Data Mining (KDD), pages 1093-1103. ACM, 2019. Google Scholar
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail