Near-Linear Time Edit Distance for Indel Channels

Authors Arun Ganesh, Aaron Sy



PDF
Thumbnail PDF

File

LIPIcs.WABI.2020.17.pdf
  • Filesize: 0.56 MB
  • 18 pages

Document Identifiers

Author Details

Arun Ganesh
  • Department of Electrical Engineering and Computer Sciences, UC Berkeley, CA, USA
Aaron Sy
  • Department of Electrical Engineering and Computer Sciences, UC Berkeley, CA, USA

Acknowledgements

We thank Satish Rao for suggesting the problem and for pointing out the connection to alignment heuristics used in practice. We thank Nir Yosef for helpful discussions on models for indels used in computational biology. We thank the anonymous reviews for their helpful feedback regarding the presentation of the results.

Cite AsGet BibTex

Arun Ganesh and Aaron Sy. Near-Linear Time Edit Distance for Indel Channels. In 20th International Workshop on Algorithms in Bioinformatics (WABI 2020). Leibniz International Proceedings in Informatics (LIPIcs), Volume 172, pp. 17:1-17:18, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2020)
https://doi.org/10.4230/LIPIcs.WABI.2020.17

Abstract

We consider the following model for sampling pairs of strings: s₁ is a uniformly random bitstring of length n, and s₂ is the bitstring arrived at by applying substitutions, insertions, and deletions to each bit of s₁ with some probability. We show that the edit distance between s₁ and s₂ can be computed in O(n ln n) time with high probability, as long as each bit of s₁ has a mutation applied to it with probability at most a small constant. The algorithm is simple and only uses the textbook dynamic programming algorithm as a primitive, first computing an approximate alignment between the two strings, and then running the dynamic programming algorithm restricted to entries close to the approximate alignment. The analysis of our algorithm provides theoretical justification for alignment heuristics used in practice such as BLAST, FASTA, and MAFFT, which also start by computing approximate alignments quickly and then find the best alignment near the approximate alignment. Our main technical contribution is a partitioning of alignments such that the number of the subsets in the partition is not too large and every alignment in one subset is worse than an alignment considered by our algorithm with high probability. Similar techniques may be of interest in the average-case analysis of other problems commonly solved via dynamic programming.

Subject Classification

ACM Subject Classification
  • Theory of computation → Dynamic programming
Keywords
  • edit distance
  • average-case analysis
  • dynamic programming
  • sequence alignment

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. S. F. Altschul, W. Gish, W. Miller, E. W. Myers, and D. J. Lipman. Basic local alignment search tool. J. Mol. Biol., 215(3):403-410, October 1990. URL: https://doi.org/10.1016/s0022-2836(05)80360-2.
  2. Alexandr Andoni, Mark Braverman, and Avinatan Hassidim. Phylogenetic reconstruction with insertions and deletions. Preprint, 2010. Google Scholar
  3. Alexandr Andoni, Constantinos Daskalakis, Avinatan Hassidim, and Sebastien Roch. Global alignment of molecular sequences via ancestral state reconstruction. Stochastic Processes and their Applications, 122(12):3852-3874, 2012. URL: https://doi.org/10.1016/j.spa.2012.08.004.
  4. Alexandr Andoni and Robert Krauthgamer. The smoothed complexity of edit distance. In Luca Aceto, Ivan Damgård, Leslie Ann Goldberg, Magnús M. Halldórsson, Anna Ingólfsdóttir, and Igor Walukiewicz, editors, Automata, Languages and Programming, pages 357-369, Berlin, Heidelberg, 2008. Springer Berlin Heidelberg. URL: https://doi.org/10.1145/2344422.2344434.
  5. Alexandr Andoni, Robert Krauthgamer, and Krzysztof Onak. Polylogarithmic Approximation for Edit Distance and the Asymmetric Query Complexity, pages 244-252. Springer Berlin Heidelberg, Berlin, Heidelberg, 2010. URL: https://doi.org/10.1007/978-3-642-16367-8_16.
  6. Alexandr Andoni and Krzysztof Onak. Approximating edit distance in near-linear time. SIAM Journal on Computing, 41(6):1635-1648, 2012. URL: https://doi.org/10.1137/090767182.
  7. Arturs Backurs and Piotr Indyk. Edit distance cannot be computed in strongly subquadratic time (unless seth is false). In Proceedings of the Forty-Seventh Annual ACM Symposium on Theory of Computing, STOC ’15, page 51–58, New York, NY, USA, 2015. Association for Computing Machinery. URL: https://doi.org/10.1145/2746539.2746612.
  8. Tugkan Batu, Funda Ergün, Joe Kilian, Avner Magen, Sofya Raskhodnikova, Ronitt Rubinfeld, and Rahul Sami. A sublinear algorithm for weakly approximating edit distance. In STOC, 2003. URL: https://doi.org/10.1145/780542.780590.
  9. Tugkan Batu, Funda Ergün, and Süleyman Cenk Sahinalp. Oblivious string embeddings and edit distance approximations. In Proceedings of the Seventeenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2006, Miami, Florida, USA, January 22-26, 2006, pages 792-801, 2006. URL: https://doi.org/10.5555/1109557.1109644.
  10. K. Bringmann and M. Künnemann. Quadratic conditional lower bounds for string problems and dynamic time warping. In 2015 IEEE 56th Annual Symposium on Foundations of Computer Science, pages 79-97, 2015. URL: https://doi.org/10.1109/FOCS.2015.15.
  11. D. Chakraborty, D. Das, E. Goldenberg, M. Koucky, and M. Saks. Approximating edit distance within constant factor in truly sub-quadratic time. In 2018 IEEE 59th Annual Symposium on Foundations of Computer Science (FOCS), pages 979-990, 2018. URL: https://doi.org/10.1109/FOCS.2018.00096.
  12. Sanjoy Dasgupta, Christos H. Papadimitriou, and Umesh Vazirani. Algorithms. McGraw-Hill, Inc., New York, NY, USA, 1 edition, 2008. URL: https://doi.org/10.1016/j.cosrev.2008.03.001.
  13. Constantinos Daskalakis and Sebastien Roch. Alignment-free phylogenetic reconstruction. In Annual International Conference on Research in Computational Molecular Biology, pages 123-137. Springer, 2010. URL: https://doi.org/10.1007/978-3-642-12683-3_9.
  14. Martin C. Frith. Large-scale sequence comparison: Spaced seeds and suffix arrays, 2008. URL: http://last.cbrc.jp/mcf-kyoto08.pdf.
  15. Martin C Frith. How sequence alignment scores correspond to probability models. Bioinformatics, 36(2):408-415, July 2019. URL: https://doi.org/10.1093/bioinformatics/btz576.
  16. Arun Ganesh and Qiuyi (Richard) Zhang. Optimal sequence length requirements for phylogenetic tree reconstruction with indels. In Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, STOC 2019, page 721–732, New York, NY, USA, 2019. Association for Computing Machinery. URL: https://doi.org/10.1145/3313276.3316345.
  17. Paweł Gawrychowski. Faster algorithm for computing the edit distance between slp-compressed strings. In Liliana Calderón-Benavides, Cristina González-Caro, Edgar Chávez, and Nivio Ziviani, editors, String Processing and Information Retrieval, pages 229-236, Berlin, Heidelberg, 2012. Springer Berlin Heidelberg. URL: https://doi.org/10.1007/978-3-642-34109-0_24.
  18. Nina Holden, Robin Pemantle, and Yuval Peres. Subpolynomial trace reconstruction for random strings and arbitrary deletion probability. In COLT, 2018. URL: http://proceedings.mlr.press/v75/holden18a.html.
  19. Thomas Holenstein, Michael Mitzenmacher, Rina Panigrahy, and Udi Wieder. Trace reconstruction with constant deletion probability and related results. In Proceedings of the Annual ACM-SIAM Symposium on Discrete Algorithms, pages 389-398, January 2008. URL: https://doi.org/10.1145/1347082.1347125.
  20. Kazutaka Katoh, Kazuharu Misawa, Keiichi Kuma, and Takashi Miyata. MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Research, 30(14):3059-3066, July 2002. URL: https://doi.org/10.1093/nar/gkf436.
  21. William Kuszmaul. Efficiently approximating edit distance between pseudorandom strings. In Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA ’19, page 1165–1180, USA, 2019. Society for Industrial and Applied Mathematics. Google Scholar
  22. G. Landau, E. Myers, and J. Schmidt. Incremental string comparison. SIAM Journal on Computing, 27(2):557-582, 1998. URL: https://doi.org/10.1137/S0097539794264810.
  23. William J. Masek and Michael S. Paterson. A faster algorithm computing string edit distances. Journal of Computer and System Sciences, 20:18-31, February 1980. URL: https://doi.org/10.1016/0022-0000(80)90002-1.
  24. Fedor Nazarov and Yuval Peres. Trace reconstruction with exp(o(n1/3)) samples. In Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2017, pages 1042-1046, New York, NY, USA, 2017. ACM. URL: https://doi.org/10.1145/3055399.3055494.
  25. W R Pearson and D J Lipman. Improved tools for biological sequence comparison. Proceedings of the National Academy of Sciences, 85(8):2444-2448, 1988. URL: https://doi.org/10.1073/pnas.85.8.2444.
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail