Approximating LCS and Alignment Distance over Multiple Sequences

Authors Debarati Das, Barna Saha



PDF
Thumbnail PDF

File

LIPIcs.APPROX-RANDOM.2022.54.pdf
  • Filesize: 1.8 MB
  • 21 pages

Document Identifiers

Author Details

Debarati Das
  • Pennsylvania state University, University Park, PA, USA
Barna Saha
  • University of California, San Diego, CA, USA

Cite AsGet BibTex

Debarati Das and Barna Saha. Approximating LCS and Alignment Distance over Multiple Sequences. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2022). Leibniz International Proceedings in Informatics (LIPIcs), Volume 245, pp. 54:1-54:21, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2022)
https://doi.org/10.4230/LIPIcs.APPROX/RANDOM.2022.54

Abstract

We study the problem of aligning multiple sequences with the goal of finding an alignment that either maximizes the number of aligned symbols (the longest common subsequence (LCS) problem), or minimizes the number of unaligned symbols (the alignment distance aka the complement of LCS). Multiple sequence alignment is a well-studied problem in bioinformatics and is used routinely to identify regions of similarity among DNA, RNA, or protein sequences to detect functional, structural, or evolutionary relationships among them. It is known that exact computation of LCS or alignment distance of m sequences each of length n requires Θ(n^m) time unless the Strong Exponential Time Hypothesis is false. However, unlike the case of two strings, fast algorithms to approximate LCS and alignment distance of multiple sequences are lacking in the literature. A major challenge in this area is to break the triangle inequality. Specifically, by splitting m sequences into two (roughly) equal sized groups, then computing the alignment distance in each group and finally combining them by using triangle inequality, it is possible to achieve a 2-approximation in Õ_m(n^⌈m/2⌉) time. But, an approximation factor below 2 which would need breaking the triangle inequality barrier is not known in O(n^{α m}) time for any α < 1. We make significant progress in this direction. First, we consider a semi-random model where, we show if just one out of m sequences is (p,B)-pseudorandom then, we can get a below-two approximation in Õ_m(nB^{m-1}+n^{⌊m/2⌋+3}) time. Such semi-random models are very well-studied for two strings scenario, however directly extending those works require one but all sequences to be pseudorandom, and would only give an O(1/p) approximation. We overcome these with significant new ideas. Specifically an ingredient to this proof is a new algorithm that achives below 2 approximations when alignment distance is large in Õ_m(n^{⌊m/2⌋+2}) time. This could be of independent interest. Next, for LCS of m sequences each of length n, we show if the optimum LCS is λ n for some λ ∈ [0,1], then in Õ_m(n^{⌊m/2⌋+1}) time, we can return a common subsequence of length at least λ²n/(2+ε) for any arbitrary constant ε > 0. In contrast, for two strings, the best known subquadratic algorithm may return a common subsequence of length Θ(λ⁴ n).

Subject Classification

ACM Subject Classification
  • Theory of computation → Approximation algorithms analysis
Keywords
  • String Algorithms
  • Approximation Algorithms

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Amir Abboud, Arturs Backurs, and Virginia Vassilevska Williams. Tight hardness results for LCS and other sequence similarity measures. In 2015 IEEE 56th Annual Symposium on Foundations of Computer Science, pages 59-78, 2015. Google Scholar
  2. Alexandr Andoni and Robert Krauthgamer. The smoothed complexity of edit distance. ACM Transactions on Algorithms (TALG), 8(4):1-25, 2012. Google Scholar
  3. Alexandr Andoni, Robert Krauthgamer, and Krzysztof Onak. Polylogarithmic approximation for edit distance and the asymmetric query complexity. In 51th Annual IEEE Symposium on Foundations of Computer Science, FOCS 2010, pages 377-386, 2010. Google Scholar
  4. Alexandr Andoni and Negev Shekel Nosatzki. Edit distance in near-linear time: it’s a constant factor. In 61st IEEE Annual Symposium on Foundations of Computer Science, FOCS 2020, 2020. Google Scholar
  5. Alexandr Andoni and Krzysztof Onak. Approximating edit distance in near-linear time. SIAM J. Comput., 41(6):1635-1648, 2012. Google Scholar
  6. Ziv Bar-Yossef, T. S. Jayram, Robert Krauthgamer, and Ravi Kumar. Approximating edit distance efficiently. In 45th Symposium on Foundations of Computer Science, FOCS 2004, pages 550-559, 2004. Google Scholar
  7. Tugkan Batu, Funda Ergün, Joe Kilian, Avner Magen, Sofya Raskhodnikova, Ronitt Rubinfeld, and Rahul Sami. A sublinear algorithm for weakly approximating edit distance. In Proceedings of the 35th Annual ACM Symposium on Theory of Computing, pages 316-324, 2003. Google Scholar
  8. Tugkan Batu, Funda Ergün, and Süleyman Cenk Sahinalp. Oblivious string embeddings and edit distance approximations. In Proceedings of the Seventeenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2006, pages 792-801, 2006. Google Scholar
  9. Amey Bhangale, Diptarka Chakraborty, and Rajendra Kumar. Hardness of approximation of (multi-)lcs over small alphabet. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques, APPROX/RANDOM 2020, volume 176, pages 38:1-38:16, 2020. Google Scholar
  10. Guillaume Blin, Laurent Bulteau, Minghui Jiang, Pedro J Tejada, and Stéphane Vialette. Hardness of longest common subsequence for sequences with bounded run-lengths. In Annual Symposium on Combinatorial Pattern Matching, pages 138-148, 2012. Google Scholar
  11. Mahdi Boroujeni, Soheil Ehsani, Mohammad Ghodsi, Mohammad Taghi Hajiaghayi, and Saeed Seddighin. Approximating edit distance in truly subquadratic time: Quantum and mapreduce. In Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2018, pages 1170-1189, 2018. Google Scholar
  12. Mahdi Boroujeni, Masoud Seddighin, and Saeed Seddighin. Improved algorithms for edit distance and LCS: beyond worst case. In Proceedings of the 2020 ACM-SIAM Symposium on Discrete Algorithms, SODA 2020, pages 1601-1620, 2020. Google Scholar
  13. Joshua Brakensiek and Aviad Rubinstein. Constant-factor approximation of near-linear edit distance in near-linear time. In Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing, pages 685-698, 2020. Google Scholar
  14. Diptarka Chakraborty, Debarati Das, Elazar Goldenberg, Michal Koucký, and Michael E. Saks. Approximating edit distance within constant factor in truly sub-quadratic time. In 59th IEEE Annual Symposium on Foundations of Computer Science, FOCS 2018, pages 979-990, 2018. Google Scholar
  15. Elazar Goldenberg, Aviad Rubinstein, and Barna Saha. Does preprocessing help in fast sequence comparisons? In Proccedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing, STOC 2020, pages 657-670, 2020. Google Scholar
  16. Dan Gusfield. Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. Cambridge University Press, 1997. Google Scholar
  17. Daniel S. Hirschberg. Algorithms for the longest common subsequence problem. J. ACM, 24(4):664-675, 1977. Google Scholar
  18. Tao Jiang and Ming Li. On the approximation of shortest common supersequences and longest common subsequences. SIAM Journal on Computing, 24(5):1122-1139, 1995. Google Scholar
  19. Michal Koucký and Michael E. Saks. Constant factor approximations to edit distance on far input pairs in nearly linear time. In Proccedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing, STOC 2020, pages 699-712, 2020. Google Scholar
  20. William Kuszmaul. Efficiently approximating edit distance between pseudorandom strings. In Proceedings of the thirtieth annual ACM-SIAM Symposium on Discrete Algorithms, pages 1165-1180. SIAM, 2019. Google Scholar
  21. Gad M. Landau, Eugene W. Myers, and Jeanette P. Schmidt. Incremental string comparison. SIAM J. Comput., 27(2):557-582, 1998. Google Scholar
  22. Gad M. Landau and Uzi Vishkin. Fast string matching with k differences. J. Comput. Syst. Sci., 37(1):63-78, 1988. Google Scholar
  23. David Maier. The complexity of some problems on subsequences and supersequences. Journal of the ACM (JACM), 25(2):322-336, 1978. Google Scholar
  24. Eugene W. Myers. An O(ND) difference algorithm and its variations. Algorithmica, 1(2):251-266, 1986. Google Scholar
  25. François Nicolas and Eric Rivals. Hardness results for the center and median string problems under the weighted and unweighted edit distances. J. Discrete Algorithms, 3(2-4):390-415, 2005. Google Scholar
  26. Pavel A Pevzner. Multiple alignment, communication cost, and graph matching. SIAM Journal on Applied Mathematics, 52(6):1763-1779, 1992. Google Scholar
  27. Aviad Rubinstein. Approximating edit distance, 2018. Google Scholar
  28. Aviad Rubinstein, Saeed Seddighin, Zhao Song, and Xiaorui Sun. Approximation algorithms for LCS and LIS with truly improved running times. In 60th IEEE Annual Symposium on Foundations of Computer Science, FOCS 2019, pages 1121-1145, 2019. Google Scholar
  29. Julie D Thompson, Desmond G Higgins, and Toby J Gibson. Clustal w: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic acids research, 22:4673-4680, 1994. Google Scholar
  30. Esko Ukkonen. Algorithms for approximate string matching. Information and Control, 64(1-3):100-118, 1985. Google Scholar
  31. Nuzzo R. Van Noorden R, Maher B. The top 100 papers. Nature, 2014. Google Scholar
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail