A Succinct Four Russians Speedup for Edit Distance Computation and One-against-many Banded Alignment

Authors Brian Brubach , Jay Ghurye



PDF
Thumbnail PDF

File

LIPIcs.CPM.2018.13.pdf
  • Filesize: 452 kB
  • 12 pages

Document Identifiers

Author Details

Brian Brubach
  • Department of Computer Science, University of Maryland, College Park, MD 20742, USA
Jay Ghurye
  • Department of Computer Science, University of Maryland, College Park, MD 20742, USA

Cite As Get BibTex

Brian Brubach and Jay Ghurye. A Succinct Four Russians Speedup for Edit Distance Computation and One-against-many Banded Alignment. In 29th Annual Symposium on Combinatorial Pattern Matching (CPM 2018). Leibniz International Proceedings in Informatics (LIPIcs), Volume 105, pp. 13:1-13:12, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2018) https://doi.org/10.4230/LIPIcs.CPM.2018.13

Abstract

The classical Four Russians speedup for computing edit distance (a.k.a. Levenshtein distance), due to Masek and Paterson [Masek and Paterson, 1980], involves partitioning the dynamic programming table into k-by-k square blocks and generating a lookup table in O(psi^{2k} k^2 |Sigma|^{2k}) time and O(psi^{2k} k |Sigma|^{2k}) space for block size k, where psi depends on the cost function (for unit costs psi = 3) and |Sigma| is the size of the alphabet. We show that the O(psi^{2k} k^2) and O(psi^{2k} k) factors can be improved to O(k^2 lg{k}) time and O(k^2) space. Thus, we improve the time and space complexity of that aspect compared to Masek and Paterson [Masek and Paterson, 1980] and remove the dependence on psi.
We further show that for certain problems the O(|Sigma|^{2k}) factor can also be reduced. Using this technique, we show a new algorithm for the fundamental problem of one-against-many banded alignment. In particular, comparing one string of length m to n other strings of length m with maximum distance d can be performed in O(n m + m d^2 lg{d} + n d^3) time. When d is reasonably small, this approaches or meets the current best theoretic result of O(nm + n d^2) achieved by using the best known pairwise algorithm running in O(m + d^2) time [Myers, 1986][Ukkonen, 1985] while potentially being more practical. It also improves on the standard practical approach which requires O(n m d) time to iteratively run an O(md) time pairwise banded alignment algorithm.
Regarding pairwise comparison, we extend the classic result of Masek and Paterson [Masek and Paterson, 1980] which computes the edit distance between two strings in O(m^2/log{m}) time to remove the dependence on psi even when edits have arbitrary costs from a penalty matrix. Crochemore, Landau, and Ziv-Ukelson [Crochemore, 2003] achieved a similar result, also allowing for unrestricted scoring matrices, but with variable-sized blocks. In practical applications of the Four Russians speedup wherein space efficiency is important and smaller block sizes k are used (notably k < |Sigma|), Kim, Na, Park, and Sim [Kim et al., 2016] showed how to remove the dependence on the alphabet size for the unit cost version, generating a lookup table in O(3^{2k} (2k)! k^2) time and O(3^{2k} (2k)! k) space. Combining their work with our result yields an improvement to O((2k)! k^2 lg{k}) time and O((2k)! k^2) space.

Subject Classification

ACM Subject Classification
  • Theory of computation → Pattern matching
Keywords
  • edit distance
  • banded alignment
  • one-against-many alignment
  • genomics
  • method of the Four Russians

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Aggarwal and J. Park. Notes on searching in multidimensional monotone arrays. In [Proceedings 1988] 29th Annual Symposium on Foundations of Computer Science, pages 497-512, Oct 1988. Google Scholar
  2. Alok Aggarwal, Maria M. Klawe, Shlomo Moran, Peter Shor, and Robert Wilber. Geometric applications of a matrix-searching algorithm. Algorithmica, 2(1):195-208, Nov 1987. Google Scholar
  3. Arturs Backurs and Piotr Indyk. Edit distance cannot be computed in strongly subquadratic time (unless SETH is false). In Proceedings of the Forty-seventh Annual ACM Symposium on Theory of Computing, STOC '15, pages 51-58, New York, NY, USA, 2015. ACM. Google Scholar
  4. Gregory Bard. Matrix inversion (or lup-factorization) via the method of four russians, in θ(n³/log n) time. LMS J. Comput. Math, 1:14, 2008. Google Scholar
  5. Karl Bringmann and Marvin Kunnemann. Quadratic conditional lower bounds for string problems and dynamic time warping. In Proceedings of the 2015 IEEE 56th Annual Symposium on Foundations of Computer Science (FOCS), FOCS '15, pages 79-97, Washington, DC, USA, 2015. IEEE Computer Society. Google Scholar
  6. Brian Brubach, Jay Ghurye, Mihai Pop, and Aravind Srinivasan. Better greedy sequence clustering with fast banded alignment. In Russell Schwartz and Knut Reinert, editors, 17th International Workshop on Algorithms in Bioinformatics (WABI 2017), volume 88 of Leibniz International Proceedings in Informatics (LIPIcs), pages 3:1-3:13, Dagstuhl, Germany, 2017. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik. Google Scholar
  7. J Gregory Caporaso, Christian L Lauber, William A Walters, Donna Berg-Lyons, James Huntley, Noah Fierer, Sarah M Owens, Jason Betley, Louise Fraser, Markus Bauer, et al. Ultra-high-throughput microbial community analysis on the illumina hiseq and miseq platforms. The ISME journal, 6(8):1621-1624, 2012. Google Scholar
  8. Maxime Crochemore, Gad M. Landau, and Michal Ziv-Ukelson. A subquadratic sequence alignment algorithm for unrestricted scoring matrices. SIAM J. Comput., 32(6):1654-1673, 2003. Google Scholar
  9. J. W. Fickett. Fast optimal alignment. Nucleic Acids Res., 12(1 Pt 1):175-179, Jan 1984. Google Scholar
  10. Yelena Frid and Dan Gusfield. A simple, practical and complete o-time algorithm for rna folding using the four-russians speedup. Algorithms for Molecular Biology, 5(1):13, 2010. Google Scholar
  11. Dan Gusfield. Algorithms on strings, trees and sequences: computer science and computational biology. Cambridge university press, 1997. Google Scholar
  12. Ahmed Hassan, Sara Noeman, and Hany Hassan. Language independent text correction using finite state automata. In In Proceedings of the Third International Joint Conference on Natural Language Processing, pages 913-918, 2008. Google Scholar
  13. Youngho Kim, Joong Chae Na, Heejin Park, and Jeong Seop Sim. A space-efficient alphabet-independent four-russians' lookup table and a multithreaded four-russians' edit distance algorithm. Theor. Comput. Sci., 656:173-179, 2016. URL: http://dx.doi.org/10.1016/j.tcs.2016.04.028.
  14. William J. Masek and Michael S. Paterson. How to compute string-edit distances quickly. In Time Warps, String Edits, and Macromolecules: the Theory and Practice of Sequence Comparison, pages 337-349. Addison-Wesley Publ. Co., Mass., 1983. Google Scholar
  15. William J. Masek and Mike Paterson. A faster algorithm computing string edit distances. J. Comput. Syst. Sci., 20(1):18-31, 1980. URL: http://dx.doi.org/10.1016/0022-0000(80)90002-1.
  16. Stoyan Mihov and Klaus U. Schulz. Fast approximate search in large dictionaries. Comput. Linguist., 30(4):451-477, 2004. Google Scholar
  17. Eugene W. Myers. An O(ND) difference algorithm and its variations. Algorithmica, 1(1):251-266, Nov 1986. Google Scholar
  18. Gene Myers. A four russians algorithm for regular expression pattern matching. Journal of the ACM (JACM), 39(2):432-448, 1992. Google Scholar
  19. Jeanette P. Schmidt. All highest scoring paths in weighted grid graphs and their application to finding all approximate repeats in strings. SIAM J. Comput., 27(4):972-992, 1998. Google Scholar
  20. Claus-Peter Schnorr. An algorithm for transitive closure with linear expected time. SIAM Journal on Computing, 7(2):127-133, 1978. Google Scholar
  21. Klaus Schulz and Stoyan Mihov. Fast string correction with levenshtein-automata. International Journal of Document Analysis and Recognition, 5:67-85, 2002. Google Scholar
  22. Esko Ukkonen. Algorithms for approximate string matching. Inf. Control, 64(1-3):100-118, 1985. Google Scholar
  23. Sun Wu, Udi Manber, and Eugene Myers. A subquadratic algorithm for approximate regular expression matching. Journal of algorithms, 19(3):346-360, 1995. Google Scholar
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail