An Algorithmic Bridge Between Hamming and Levenshtein Distances

Authors Elazar Goldenberg , Tomasz Kociumaka , Robert Krauthgamer, Barna Saha



PDF
Thumbnail PDF

File

LIPIcs.ITCS.2023.58.pdf
  • Filesize: 0.96 MB
  • 23 pages

Document Identifiers

Author Details

Elazar Goldenberg
  • Academic College of Tel Aviv-Yafo, Israel
Tomasz Kociumaka
  • Max Planck Institute for Informatics, Saarland Informatics Campus, Saarbrücken, Germany
Robert Krauthgamer
  • Weizmann Institute of Science, Rehovot, Israel
Barna Saha
  • University of California, San Diego, CA, USA

Cite AsGet BibTex

Elazar Goldenberg, Tomasz Kociumaka, Robert Krauthgamer, and Barna Saha. An Algorithmic Bridge Between Hamming and Levenshtein Distances. In 14th Innovations in Theoretical Computer Science Conference (ITCS 2023). Leibniz International Proceedings in Informatics (LIPIcs), Volume 251, pp. 58:1-58:23, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2023)
https://doi.org/10.4230/LIPIcs.ITCS.2023.58

Abstract

The edit distance between strings classically assigns unit cost to every character insertion, deletion, and substitution, whereas the Hamming distance only allows substitutions. In many real-life scenarios, insertions and deletions (abbreviated indels) appear frequently but significantly less so than substitutions. To model this, we consider substitutions being cheaper than indels, with cost 1/a for a parameter a ≥ 1. This basic variant, denoted ED_a, bridges classical edit distance (a = 1) with Hamming distance (a → ∞), leading to interesting algorithmic challenges: Does the time complexity of computing ED_a interpolate between that of Hamming distance (linear time) and edit distance (quadratic time)? What about approximating ED_a? We first present a simple deterministic exact algorithm for ED_a and further prove that it is near-optimal assuming the Orthogonal Vectors Conjecture. Our main result is a randomized algorithm computing a (1+ε)-approximation of ED_a(X,Y), given strings X,Y of total length n and a bound k ≥ ED_a(X,Y). For simplicity, let us focus on k ≥ 1 and a constant ε > 0; then, our algorithm takes Õ(n/a + ak³) time. Unless a = Õ(1), in which case ED_a resembles the standard edit distance, and for the most interesting regime of small enough k, this running time is sublinear in n. We also consider a very natural version that asks to find a (k_I, k_S)-alignment, i.e., an alignment with at most k_I indels and k_S substitutions. In this setting, we give an exact algorithm and, more importantly, an Õ((nk_I)/k_S + k_S k_I³)-time (1,1+ε)-bicriteria approximation algorithm. The latter solution is based on the techniques we develop for ED_a for a = Θ(k_S/k_I), and its running time is again sublinear in n whenever k_I ≪ k_S and the overall distance is small enough. These bounds are in stark contrast to unit-cost edit distance, where state-of-the-art algorithms are far from achieving (1+ε)-approximation in sublinear time, even for a favorable choice of k.

Subject Classification

ACM Subject Classification
  • Theory of computation → Pattern matching
  • Theory of computation → Streaming, sublinear and near linear time algorithms
Keywords
  • edit distance
  • Hamming distance
  • Longest Common Extension queries

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Srinivas Aluru, editor. Handbook of Computational Molecular Biology. Chapman and Hall/CRC, December 2005. URL: https://doi.org/10.1201/9781420036275.
  2. Alexandr Andoni, Robert Krauthgamer, and Krzysztof Onak. Polylogarithmic approximation for edit distance and the asymmetric query complexity. In 51st Annual IEEE Symposium on Foundations of Computer Science, FOCS 2010, pages 377-386. IEEE, 2010. URL: https://doi.org/10.1109/FOCS.2010.43.
  3. Alexandr Andoni and Negev Shekel Nosatzki. Edit distance in near-linear time: it’s a constant factor. In 61st Annual IEEE Symposium on Foundations of Computer Science, FOCS 2020. IEEE, 2020. URL: https://doi.org/10.1109/FOCS46700.2020.00096.
  4. Alexandr Andoni and Krzysztof Onak. Approximating edit distance in near-linear time. SIAM Journal on Computing, 41(6):1635-1648, 2012. URL: https://doi.org/10.1137/090767182.
  5. Arturs Backurs and Piotr Indyk. Edit distance cannot be computed in strongly subquadratic time (unless SETH is false). SIAM Journal on Computing, 47(3):1087-1097, 2018. URL: https://doi.org/10.1137/15M1053128.
  6. Ziv Bar-Yossef, T. S. Jayram, Robert Krauthgamer, and Ravi Kumar. Approximating edit distance efficiently. In 45th Annual IEEE Symposium on Foundations of Computer Science, FOCS 2004, pages 550-559. IEEE, 2004. URL: https://doi.org/10.1109/FOCS.2004.14.
  7. Tugkan Batu, Funda Ergün, Joe Kilian, Avner Magen, Sofya Raskhodnikova, Ronitt Rubinfeld, and Rahul Sami. A sublinear algorithm for weakly approximating edit distance. In 35th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2003, pages 316-324. ACM, 2003. URL: https://doi.org/10.1145/780542.780590.
  8. Joshua Brakensiek, Moses Charikar, and Aviad Rubinstein. A simple sublinear algorithm for gap edit distance, 2020. URL: http://arxiv.org/abs/2007.14368.
  9. Joshua Brakensiek and Aviad Rubinstein. Constant-factor approximation of near-linear edit distance in near-linear time. In 52nd Annual ACM SIGACT Symposium on Theory of Computing, STOC 2020, pages 685-698. ACM, 2020. URL: https://doi.org/10.1145/3357713.3384282.
  10. Karl Bringmann, Alejandro Cassis, Nick Fischer, and Vasileios Nakos. Almost-optimal sublinear-time edit distance in the low distance regime. In 54th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2022, pages 1102-1115. ACM, 2022. URL: https://doi.org/10.1145/3519935.3519990.
  11. Karl Bringmann, Alejandro Cassis, Nick Fischer, and Vasileios Nakos. Improved sublinear-time edit distance for preprocessed strings. In 49th International Colloquium on Automata, Languages, and Programming, ICALP 2022, volume 229 of LIPIcs, pages 32:1-32:20. Schloss Dagstuhl-Leibniz-Zentrum für Informatik, 2022. URL: https://doi.org/10.4230/LIPIcs.ICALP.2022.32.
  12. Karl Bringmann and Marvin Künnemann. Quadratic conditional lower bounds for string problems and dynamic time warping. In 56th Annual IEEE Symposium on Foundations of Computer Science, FOCS 2015, pages 79-97. IEEE, 2015. URL: https://doi.org/10.1109/focs.2015.15.
  13. Diptarka Chakraborty, Debarati Das, Elazar Goldenberg, Michal Koucký, and Michael E. Saks. Approximating edit distance within constant factor in truly sub-quadratic time. Journal of the ACM, 67(6):36:1-36:22, 2020. URL: https://doi.org/10.1145/3422823.
  14. Timothy M. Chan, Shay Golan, Tomasz Kociumaka, Tsvi Kopelowitz, and Ely Porat. Approximating text-to-pattern Hamming distances. In 52nd Annual ACM SIGACT Symposium on Theory of Computing, STOC 2020, pages 643-656. ACM, 2020. URL: https://doi.org/10.1145/3357713.3384266.
  15. Jian-Qun Chen, Ying Wu, Haiwang Yang, Joy Bergelson, Martin Kreitman, and Dacheng Tian. Variation in the ratio of nucleotide substitution and indel rates across genomes in mammals and bacteria. Molecular Biology and Evolution, 26(7):1523-1531, 2009. URL: https://doi.org/10.1093/molbev/msp063.
  16. Martin Farach-Colton, Paolo Ferragina, and S. Muthukrishnan. On the sorting-complexity of suffix tree construction. Journal of the ACM, 47(6):987-1011, 2000. URL: https://doi.org/10.1145/355541.355547.
  17. Elazar Goldenberg, Tomasz Kociumaka, Robert Krauthgamer, and Barna Saha. Gap edit distance via non-adaptive queries: Simple and optimal. In 63rd Annual IEEE Symposium on Foundations of Computer Science, FOCS 2022, pages 674-685. IEEE, 2022. URL: https://doi.org/10.1109/FOCS54457.2022.00070.
  18. Elazar Goldenberg, Robert Krauthgamer, and Barna Saha. Sublinear algorithms for gap edit distance. In 60th Annual IEEE Symposium on Foundations of Computer Science, FOCS 2019, pages 1101-1120. IEEE, 2019. URL: https://doi.org/10.1109/FOCS.2019.00070.
  19. Elazar Goldenberg, Aviad Rubinstein, and Barna Saha. Does preprocessing help in fast sequence comparisons? In 52nd Annual ACM SIGACT Symposium on Theory of Computing, STOC 2020, pages 657-670. ACM, 2020. URL: https://doi.org/10.1145/3357713.3384300.
  20. Dan Gusfield. Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. Cambridge University Press, 1997. URL: https://doi.org/10.1017/cbo9780511574931.
  21. Sariel Har-Peled, Piotr Indyk, and Rajeev Motwani. Approximate nearest neighbor: Towards removing the curse of dimensionality. Theory of Computing, 8(1):321-350, 2012. URL: https://doi.org/10.4086/toc.2012.v008a014.
  22. John A. Hawkins, Stephen K. Jones, Ilya J. Finkelstein, and William H. Press. Indel-correcting DNA barcodes for high-throughput sequencing. Proceedings of the National Academy of Sciences, 115(27):E6217-E6226, 2018. URL: https://doi.org/10.1073/pnas.1802640115.
  23. Dan Jurafsky and James H. Martin. Speech and language processing: an introduction to natural language processing, computational linguistics, and speech recognition, 2nd Edition. Prentice Hall series in artificial intelligence. Prentice Hall, Pearson Education International, 2009. URL: https://www.worldcat.org/oclc/315913020.
  24. Tomasz Kociumaka and Barna Saha. Sublinear-time algorithms for computing & embedding gap edit distance. In 61st Annual IEEE Symposium on Foundations of Computer Science, FOCS 2020, pages 1168-1179. IEEE, 2020. URL: https://doi.org/10.1109/focs46700.2020.00112.
  25. Michal Koucký and Michael E. Saks. Constant factor approximations to edit distance on far input pairs in nearly linear time. In 52nd Annual ACM SIGACT Symposium on Theory of Computing, STOC 2020, pages 699-712. ACM, 2020. URL: https://doi.org/10.1145/3357713.3384307.
  26. Gad M. Landau, Eugene W. Myers, and Jeanette P. Schmidt. Incremental string comparison. SIAM Journal on Computing, 27(2):557-582, 1998. URL: https://doi.org/10.1137/S0097539794264810.
  27. Gad M. Landau and Uzi Vishkin. Fast string matching with k differences. Journal of Computer and System Sciences, 37(1):63-78, 1988. URL: https://doi.org/10.1016/0022-0000(88)90045-1.
  28. Paul Medvedev. Theoretical analysis of edit distance algorithms: an applied perspective, 2022. URL: https://doi.org/10.48550/arXiv.2204.09535.
  29. Julienne M. Mullaney, Ryan E. Mills, W. Stephen Pittard, and Scott E. Devine. Small insertions and deletions (INDELs) in human genomes. Human Molecular Genetics, 19(R2):R131-R136, 2010. URL: https://doi.org/10.1093/hmg/ddq400.
  30. Kerstin Neininger, Tobias Marschall, and Volkhard Helms. SNP and indel frequencies at transcription start sites and at canonical and alternative translation initiation sites in the human genome. PLOS ONE, 14(4):1-21, 2019. URL: https://doi.org/10.1371/journal.pone.0214816.
  31. Esko Ukkonen. Algorithms for approximate string matching. Information and Control, 64:100-118, 1985. URL: https://doi.org/10.1016/S0019-9958(85)80046-2.
  32. Ryan Williams. A new algorithm for optimal 2-constraint satisfaction and its implications. Theoretical Computer Science, 348(2-3):357-365, 2005. URL: https://doi.org/10.1016/j.tcs.2005.09.023.
  33. Zhaolei Zhang and Mark Gerstein. Patterns of nucleotide substitution, insertion and deletion in the human genome inferred from pseudogenes. Nucleic Acids Research, 31(18):5338-5348, 2003. URL: https://doi.org/10.1093/nar/gkg745.