On Estimating Edit Distance: Alignment, Dimension Reduction, and Embeddings

Authors Moses Charikar, Ofir Geri, Michael P. Kim, William Kuszmaul



PDF
Thumbnail PDF

File

LIPIcs.ICALP.2018.34.pdf
  • Filesize: 483 kB
  • 14 pages

Document Identifiers

Author Details

Moses Charikar
  • Department of Computer Science, Stanford University, Stanford, CA, USA
Ofir Geri
  • Department of Computer Science, Stanford University, Stanford, CA, USA
Michael P. Kim
  • Department of Computer Science, Stanford University, Stanford, CA, USA
William Kuszmaul
  • Department of Computer Science, Stanford University, Stanford, CA, USA

Cite As Get BibTex

Moses Charikar, Ofir Geri, Michael P. Kim, and William Kuszmaul. On Estimating Edit Distance: Alignment, Dimension Reduction, and Embeddings. In 45th International Colloquium on Automata, Languages, and Programming (ICALP 2018). Leibniz International Proceedings in Informatics (LIPIcs), Volume 107, pp. 34:1-34:14, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2018) https://doi.org/10.4230/LIPIcs.ICALP.2018.34

Abstract

Edit distance is a fundamental measure of distance between strings and has been widely studied in computer science. While the problem of estimating edit distance has been studied extensively, the equally important question of actually producing an alignment (i.e., the sequence of edits) has received far less attention. Somewhat surprisingly, we show that any algorithm to estimate edit distance can be used in a black-box fashion to produce an approximate alignment of strings, with modest loss in approximation factor and small loss in run time. Plugging in the result of Andoni, Krauthgamer, and Onak, we obtain an alignment that is a (log n)^{O(1/epsilon^2)} approximation in time O~(n^{1 + epsilon}).
Closely related to the study of approximation algorithms is the study of metric embeddings for edit distance. We show that min-hash techniques can be useful in designing edit distance embeddings through three results: (1) An embedding from Ulam distance (edit distance over permutations) to Hamming space that matches the best known distortion of O(log n) and also implicitly encodes a sequence of edits between the strings; (2) In the case where the edit distance between the input strings is known to have an upper bound K, we show that embeddings of edit distance into Hamming space with distortion f(n) can be modified in a black-box fashion to give distortion O(f(poly(K))) for a class of periodic-free strings; (3) A randomized dimension-reduction map with contraction c and asymptotically optimal expected distortion O(c), improving on the previous O~(c^{1 + 2 / log log log n}) distortion result of Batu, Ergun, and Sahinalp.

Subject Classification

ACM Subject Classification
  • Theory of computation → Approximation algorithms analysis
Keywords
  • edit distance
  • alignment
  • approximation algorithms
  • embedding
  • dimension reduction

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman. Gapped blast and psi-blast: a new generation of protein database search programs. Nucleic acids research, 25(17):3389-3402, 1997. Google Scholar
  2. Alexandr Andoni and Robert Krauthgamer. The computational hardness of estimating edit distance. SIAM J. Comput., 39(6):2398-2429, 2010. Google Scholar
  3. Alexandr Andoni, Robert Krauthgamer, and Krzysztof Onak. Polylogarithmic approximation for edit distance and the asymmetric query complexity. In Proceedings of the 51st Annual Symposium on Foundations of Computer Science (FOCS), pages 377-386. IEEE, 2010. Google Scholar
  4. Alexandr Andoni and Krzysztof Onak. Approximating edit distance in near-linear time. SIAM J. Comput., 41(6):1635-1648, 2012. Google Scholar
  5. Arturs Backurs and Piotr Indyk. Edit distance cannot be computed in strongly subquadratic time (unless seth is false). In Proceedings of the 47th Annual Symposium on Theory of Computing (STOC), pages 51-58. ACM, 2015. Google Scholar
  6. Ziv Bar-Yossef, T.S. Jayram, Robert Krauthgamer, and Ravi Kumar. Approximating edit distance efficiently. In Proceedings of 45th Annual Symposium on Foundations of Computer Science (FOCS), pages 550-559. IEEE, 2004. Google Scholar
  7. Tugkan Batu, Funda Ergün, and Süleyman Cenk Sahinalp. Oblivious string embeddings and edit distance approximations. In Proceedings of the 17th Annual Symposium on Discrete Algorithms (SODA), pages 792-801, 2006. Google Scholar
  8. Djamal Belazzougui and Qin Zhang. Edit distance: Sketching, streaming, and document exchange. In Proceedings of the 57th Annual Symposium on Foundations of Computer Science (FOCS), pages 51-60. IEEE, 2016. Google Scholar
  9. Diptarka Chakraborty, Elazar Goldenberg, and Michal Kouckỳ. Streaming algorithms for embedding and computing edit distance in the low distance regime. In Proceedings of the 48th Annual Symposium on Theory of Computing (STOC), pages 712-725. ACM, 2016. Google Scholar
  10. Kun-Mao Chao, William R. Pearson, and Webb Miller. Aligning two sequences within a specified diagonal band. Bioinformatics, 8(5):481-487, 1992. Google Scholar
  11. Moses Charikar and Robert Krauthgamer. Embedding the ulam metric into l_1. Theory of Computing, 2(11):207-224, 2006. Google Scholar
  12. Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. Introduction to algorithms. The MIT Press, third edition, 2009. Google Scholar
  13. Johannes Fischer and Volker Heun. Theoretical and practical improvements on the rmq-problem, with applications to LCA and LCE. In Proceedings of the 17th Annual Symposium on Combinatorial Pattern Matching (CPM), pages 36-48, 2006. Google Scholar
  14. Dan Gusfield. Algorithms on strings, trees and sequences: computer science and computational biology. Cambridge University Press, 1997. Google Scholar
  15. Piotr Indyk. Algorithmic aspects of geometric embeddings (tutorial). In Proceedings of the 42nd Annual Symposium on Foundations of Computer Science (FOCS), pages 10-33. IEEE, 2001. Google Scholar
  16. Piotr Indyk and Jiri Matoušek. Low-distortion embeddings of finite metric spaces. Handbook of Discrete and Computational Geometry, page 177, 2004. Google Scholar
  17. Robert Krauthgamer and Yuval Rabani. Improved lower bounds for embeddings into L_1. In Proceedings of the 17th Annual Symposium on Discrete Algorithms (SODA), pages 1010-1017, 2006. Google Scholar
  18. Gad M. Landau, Eugene W. Myers, and Jeanette P. Schmidt. Incremental string comparison. SIAM Journal on Computing, 27(2):557-582, 1998. Google Scholar
  19. Gonzalo Navarro. A guided tour to approximate string matching. ACM computing surveys (CSUR), 33(1):31-88, 2001. Google Scholar
  20. Saul B. Needleman and Christian D. Wunsch. A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of molecular biology, 48(3):443-453, 1970. Google Scholar
  21. Rafail Ostrovsky and Yuval Rabani. Low distortion embeddings for edit distance. J. ACM, 54(5):23, 2007. Google Scholar
  22. Anna Pagh and Rasmus Pagh. Uniform hashing in constant time and optimal space. SIAM Journal on Computing, 38(1):85-96, 2008. Google Scholar
  23. Taras K. Vintsyuk. Speech discrimination by dynamic programming. Cybernetics, 4(1):52-57, 1968. Google Scholar
  24. Robert A. Wagner and Michael J. Fischer. The string-to-string correction problem. J. ACM, 21(1):168-173, 1974. Google Scholar
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail