Approximating LCS and Alignment Distance over Multiple Sequences

Das, Debarati; Saha, Barna

doi:10.4230/LIPIcs.APPROX/RANDOM.2022.54

Subject Classification

ACM Subject Classification

Theory of computation → Approximation algorithms analysis

Keywords

String Algorithms
Approximation Algorithms

Metrics

Access Statistics
Total Accesses (updated on a weekly basis)

0

Document

0

Metadata

Abstract

We study the problem of aligning multiple sequences with the goal of finding an alignment that either maximizes the number of aligned symbols (the longest common subsequence (LCS) problem), or minimizes the number of unaligned symbols (the alignment distance aka the complement of LCS). Multiple sequence alignment is a well-studied problem in bioinformatics and is used routinely to identify regions of similarity among DNA, RNA, or protein sequences to detect functional, structural, or evolutionary relationships among them. It is known that exact computation of LCS or alignment distance of m sequences each of length n requires Θ(n^m) time unless the Strong Exponential Time Hypothesis is false. However, unlike the case of two strings, fast algorithms to approximate LCS and alignment distance of multiple sequences are lacking in the literature. A major challenge in this area is to break the triangle inequality. Specifically, by splitting m sequences into two (roughly) equal sized groups, then computing the alignment distance in each group and finally combining them by using triangle inequality, it is possible to achieve a 2-approximation in Õ_m(n^⌈m/2⌉) time. But, an approximation factor below 2 which would need breaking the triangle inequality barrier is not known in O(n^{α m}) time for any α < 1. We make significant progress in this direction. First, we consider a semi-random model where, we show if just one out of m sequences is (p,B)-pseudorandom then, we can get a below-two approximation in Õ_m(nB^{m-1}+n^{⌊m/2⌋+3}) time. Such semi-random models are very well-studied for two strings scenario, however directly extending those works require one but all sequences to be pseudorandom, and would only give an O(1/p) approximation. We overcome these with significant new ideas. Specifically an ingredient to this proof is a new algorithm that achives below 2 approximations when alignment distance is large in Õ_m(n^{⌊m/2⌋+2}) time. This could be of independent interest. Next, for LCS of m sequences each of length n, we show if the optimum LCS is λ n for some λ ∈ [0,1], then in Õ_m(n^{⌊m/2⌋+1}) time, we can return a common subsequence of length at least λ²n/(2+ε) for any arbitrary constant ε > 0. In contrast, for two strings, the best known subquadratic algorithm may return a common subsequence of length Θ(λ⁴ n).

Cite As Get BibTex

Debarati Das and Barna Saha. Approximating LCS and Alignment Distance over Multiple Sequences. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2022). Leibniz International Proceedings in Informatics (LIPIcs), Volume 245, pp. 54:1-54:21, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2022) https://doi.org/10.4230/LIPIcs.APPROX/RANDOM.2022.54

Author Details

Debarati Das

Pennsylvania state University, University Park, PA, USA

Barna Saha

University of California, San Diego, CA, USA

References

Amir Abboud, Arturs Backurs, and Virginia Vassilevska Williams. Tight hardness results for LCS and other sequence similarity measures. In 2015 IEEE 56th Annual Symposium on Foundations of Computer Science, pages 59-78, 2015.
Alexandr Andoni and Robert Krauthgamer. The smoothed complexity of edit distance. ACM Transactions on Algorithms (TALG), 8(4):1-25, 2012.
Alexandr Andoni, Robert Krauthgamer, and Krzysztof Onak. Polylogarithmic approximation for edit distance and the asymmetric query complexity. In 51th Annual IEEE Symposium on Foundations of Computer Science, FOCS 2010, pages 377-386, 2010.
Alexandr Andoni and Negev Shekel Nosatzki. Edit distance in near-linear time: it’s a constant factor. In 61st IEEE Annual Symposium on Foundations of Computer Science, FOCS 2020, 2020.
Alexandr Andoni and Krzysztof Onak. Approximating edit distance in near-linear time. SIAM J. Comput., 41(6):1635-1648, 2012.
Ziv Bar-Yossef, T. S. Jayram, Robert Krauthgamer, and Ravi Kumar. Approximating edit distance efficiently. In 45th Symposium on Foundations of Computer Science, FOCS 2004, pages 550-559, 2004.
Tugkan Batu, Funda Ergün, Joe Kilian, Avner Magen, Sofya Raskhodnikova, Ronitt Rubinfeld, and Rahul Sami. A sublinear algorithm for weakly approximating edit distance. In Proceedings of the 35th Annual ACM Symposium on Theory of Computing, pages 316-324, 2003.
Tugkan Batu, Funda Ergün, and Süleyman Cenk Sahinalp. Oblivious string embeddings and edit distance approximations. In Proceedings of the Seventeenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2006, pages 792-801, 2006.
Amey Bhangale, Diptarka Chakraborty, and Rajendra Kumar. Hardness of approximation of (multi-)lcs over small alphabet. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques, APPROX/RANDOM 2020, volume 176, pages 38:1-38:16, 2020.
Guillaume Blin, Laurent Bulteau, Minghui Jiang, Pedro J Tejada, and Stéphane Vialette. Hardness of longest common subsequence for sequences with bounded run-lengths. In Annual Symposium on Combinatorial Pattern Matching, pages 138-148, 2012.
Mahdi Boroujeni, Soheil Ehsani, Mohammad Ghodsi, Mohammad Taghi Hajiaghayi, and Saeed Seddighin. Approximating edit distance in truly subquadratic time: Quantum and mapreduce. In Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2018, pages 1170-1189, 2018.
Mahdi Boroujeni, Masoud Seddighin, and Saeed Seddighin. Improved algorithms for edit distance and LCS: beyond worst case. In Proceedings of the 2020 ACM-SIAM Symposium on Discrete Algorithms, SODA 2020, pages 1601-1620, 2020.
Joshua Brakensiek and Aviad Rubinstein. Constant-factor approximation of near-linear edit distance in near-linear time. In Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing, pages 685-698, 2020.
Diptarka Chakraborty, Debarati Das, Elazar Goldenberg, Michal Koucký, and Michael E. Saks. Approximating edit distance within constant factor in truly sub-quadratic time. In 59th IEEE Annual Symposium on Foundations of Computer Science, FOCS 2018, pages 979-990, 2018.
Elazar Goldenberg, Aviad Rubinstein, and Barna Saha. Does preprocessing help in fast sequence comparisons? In Proccedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing, STOC 2020, pages 657-670, 2020.
Dan Gusfield. Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. Cambridge University Press, 1997.
Daniel S. Hirschberg. Algorithms for the longest common subsequence problem. J. ACM, 24(4):664-675, 1977.
Tao Jiang and Ming Li. On the approximation of shortest common supersequences and longest common subsequences. SIAM Journal on Computing, 24(5):1122-1139, 1995.
Michal Koucký and Michael E. Saks. Constant factor approximations to edit distance on far input pairs in nearly linear time. In Proccedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing, STOC 2020, pages 699-712, 2020.
William Kuszmaul. Efficiently approximating edit distance between pseudorandom strings. In Proceedings of the thirtieth annual ACM-SIAM Symposium on Discrete Algorithms, pages 1165-1180. SIAM, 2019.
Gad M. Landau, Eugene W. Myers, and Jeanette P. Schmidt. Incremental string comparison. SIAM J. Comput., 27(2):557-582, 1998.
Gad M. Landau and Uzi Vishkin. Fast string matching with k differences. J. Comput. Syst. Sci., 37(1):63-78, 1988.
David Maier. The complexity of some problems on subsequences and supersequences. Journal of the ACM (JACM), 25(2):322-336, 1978.
Eugene W. Myers. An O(ND) difference algorithm and its variations. Algorithmica, 1(2):251-266, 1986.
François Nicolas and Eric Rivals. Hardness results for the center and median string problems under the weighted and unweighted edit distances. J. Discrete Algorithms, 3(2-4):390-415, 2005.
Pavel A Pevzner. Multiple alignment, communication cost, and graph matching. SIAM Journal on Applied Mathematics, 52(6):1763-1779, 1992.
Aviad Rubinstein. Approximating edit distance, 2018.
Aviad Rubinstein, Saeed Seddighin, Zhao Song, and Xiaorui Sun. Approximation algorithms for LCS and LIS with truly improved running times. In 60th IEEE Annual Symposium on Foundations of Computer Science, FOCS 2019, pages 1121-1145, 2019.
Julie D Thompson, Desmond G Higgins, and Toby J Gibson. Clustal w: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic acids research, 22:4673-4680, 1994.
Esko Ukkonen. Algorithms for approximate string matching. Information and Control, 64(1-3):100-118, 1985.
Nuzzo R. Van Noorden R, Maher B. The top 100 papers. Nature, 2014.

Approximating LCS and Alignment Distance over Multiple Sequences

Authors Debarati Das, Barna Saha

File

Document Identifiers

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

References

Thanks for your feedback!

Could not send message

Approximating LCS and Alignment Distance over Multiple Sequences

Authors Debarati Das, Barna Saha

File

Document Identifiers

Related Versions

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

Funding

References

Thanks for your feedback!

Could not send message