Beyond the Longest Letter-Duplicated Subsequence Problem

Authors Wenfeng Lai, Adiesha Liyanage, Binhai Zhu, Peng Zou



PDF
Thumbnail PDF

File

LIPIcs.CPM.2022.7.pdf
  • Filesize: 0.65 MB
  • 12 pages

Document Identifiers

Author Details

Wenfeng Lai
  • College of Computer Science and Technology, Shandong University, Qingdao, China
Adiesha Liyanage
  • Gianforte School of Computing, Montana State University, Bozeman, MT, USA
Binhai Zhu
  • Gianforte School of Computing, Montana State University, Bozeman, MT, USA
Peng Zou
  • Gianforte School of Computing, Montana State University, Bozeman, MT, USA

Cite AsGet BibTex

Wenfeng Lai, Adiesha Liyanage, Binhai Zhu, and Peng Zou. Beyond the Longest Letter-Duplicated Subsequence Problem. In 33rd Annual Symposium on Combinatorial Pattern Matching (CPM 2022). Leibniz International Proceedings in Informatics (LIPIcs), Volume 223, pp. 7:1-7:12, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2022)
https://doi.org/10.4230/LIPIcs.CPM.2022.7

Abstract

Motivated by computing duplication patterns in sequences, a new fundamental problem called the longest letter-duplicated subsequence (LLDS) is proposed. Given a sequence S of length n, a letter-duplicated subsequence is a subsequence of S in the form of x₁^{d₁}x₂^{d₂}⋯ x_k^{d_k} with x_i ∈ Σ, x_j≠ x_{j+1} and d_i ≥ 2 for all i in [k] and j in [k-1]. A linear time algorithm for computing the longest letter-duplicated subsequence (LLDS) of S can be easily obtained. In this paper, we focus on two variants of this problem. We first consider the constrained version when Σ is unbounded, each letter appears in S at least 6 times and all the letters in Σ must appear in the solution. We show that the problem is NP-hard (a further twist indicates that the problem does not admit any polynomial time approximation). The reduction is from possibly the simplest version of SAT that is NP-complete, (≤ 2,1, ≤ 3)-SAT, where each variable appears at most twice positively and exact once negatively, and each clause contains at most three literals and some clauses must contain exactly two literals. (We hope that this technique will serve as a general tool to help us proving the NP-hardness for some more tricky sequence problems involving only one sequence - much harder than with at least two input sequences, which we apply successfully at the end of the paper on some extra variations of the LLDS problem.) We then show that when each letter appears in S at most 3 times, then the problem admits a factor 1.5-O(1/n) approximation. Finally, we consider the weighted version, where the weight of a block x_i^{d_i} (d_i ≥ 2) could be any positive function which might not grow with d_i. We give a non-trivial O(n²) time dynamic programming algorithm for this version, i.e., computing an LD-subsequence of S whose weight is maximized.

Subject Classification

ACM Subject Classification
  • Theory of computation
Keywords
  • Segmental duplications
  • Tandem duplications
  • Longest common subsequence
  • NP-completeness
  • Dynamic programming

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Daniel P. Bovet and Stefano Varricchio. On the regularity of languages on a binary alphabet generated by copying systems. Information Processing Letters, 44(3):119-123, 1992. Google Scholar
  2. Ferdinando Cicalese and Nicolo Pilati. The tandem duplication distance problem is hard over bounded alphabets. In Paola Flocchini and Lucia Moura, editors, Combinatorial Algorithms - 21st International Workshop, IWOCA 2021, Ottawa, Canada, July 5-7, 2021, volume 12757 of Lecture Notes in Computer Science, pages 179-193. Springer, 2021. Google Scholar
  3. Giovanni Ciriello, Martin L Miller, Bülent Arman Aksoy, Yasin Senbabaoglu, Nikolaus Schultz, and Chris Sander. Emerging landscape of oncogenic signatures across human cancers. Nature Genetics, 45:1127-1133, 2013. Google Scholar
  4. Juegen Dassow, Victor Mitrana, and Gheorghe Paun. On the regularity of the duplication closure. Bulletin of the EATCS, 69:133-136, 1999. Google Scholar
  5. Andrzej Ehrenfeucht and Grzegorz Rozenberg. On regularity of languages generated by copying systems. Discrete Applied Mathematics, 8(3):313-317, 1984. Google Scholar
  6. Adrian Kosowski. An efficient algorithm for the longest tandem scattered subsequence problem. In Alberto Apostolico and Massimo Melucci, editors, String Processing and Information Retrieval, 11th International Conference, SPIRE 2004, Padova, Italy, October 5-8, 2004, Proceedings, volume 3246 of Lecture Notes in Computer Science, pages 93-100. Springer, 2004. Google Scholar
  7. Manuel Lafond, Binhai Zhu, and Peng Zou. The tandem duplication distance is NP-hard. In Christophe Paul and Markus Bläser, editors, 37th International Symposium on Theoretical Aspects of Computer Science, STACS 2020, March 10-13, 2020, Montpellier, France, volume 154 of LIPIcs, pages 15:1-15:15. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2020. Google Scholar
  8. Manuel Lafond, Binhai Zhu, and Peng Zou. Computing the tandem duplication distance is NP-hard. SIAM J. Discrete Mathematics, 36(1):64-91, 2022. Google Scholar
  9. E.S. Lander, et al., and International Human Genome Sequencing Consortium. Initial sequencing and analysis of the human genome. Nature, 409(6822):860-921, 2001. Google Scholar
  10. John Leech. A problem on strings of beads. The Mathematical Gazette, 41(338):277-278, 1957. Google Scholar
  11. Marcy Macdonald, et al., and Peter S. Harper. A novel gene containing a trinucleotide repeat that is expanded and unstable on huntington’s disease. Cell, 72(6):971-983, 1993. Google Scholar
  12. The Cancer Genome Atlas Research Network. Integrated genomic analyses of ovarian carcinoma. Nature, 474:609-615, 2011. Google Scholar
  13. Layla Oesper, Anna M. Ritz, Sarah J. Aerni, Ryan Drebin, and Benjamin J. Raphael. Reconstructing cancer genomes from paired-end sequencing data. BMC Bioinformatics, 13(Suppl 6):S10, 2012. Google Scholar
  14. Thomas J. Schaefer. The complexity of satisfiability problems. In Richard J. Lipton, Walter A. Burkhard, Walter J. Savitch, Emily P. Friedman, and Alfred V. Aho, editors, Proceedings of the 10th Annual ACM Symposium on Theory of Computing, May 1-3, 1978, San Diego, California, USA, pages 216-226. ACM, 1978. Google Scholar
  15. Sven Schrinner, Manish Goel, Michael Wulfert, Philipp Spohr, Korbinian Schneeberger, and Gunnar W. Klau. The longest run subsequence problem. In Carl Kingsford and Nadia Pisanti, editors, 20th International Workshop on Algorithms in Bioinformatics, WABI 2020, September 7-9, 2020, Pisa, Italy (Virtual Conference), volume 172 of LIPIcs, pages 6:1-6:13. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2020. Google Scholar
  16. Andrew J. Sharp, et al., and Even E. Eichler. Segmental duplications and copy-number variation in the human genome. The American J. of Human Genetics, 77(1):78-88, 2005. Google Scholar
  17. Jack W. Szostak and Ray Wu. Unequal crossing over in the ribosomal dna of saccharomyces cerevisiae. Nature, 284:426-430, 1980. Google Scholar
  18. Craig A. Tovey. A simplified np-complete satisfiability problem. Discret. Appl. Math., 8(1):85-89, 1984. Google Scholar
  19. Ming-Wei Wang. On the irregularity of the duplication closure. Bulletin of the EATCS, 70:162-163, 2000. Google Scholar
  20. Chunfang Zheng, P Kerr Wall, James Leebens-Mack, Claude de Pamphilis, Victor A Albert, and David Sankoff. Gene loss under neighborhood selection following whole genome duplication and the reconstruction of the ancestral populus genome. Journal of Bioinformatics and Computational Biology, 7(03):499-520, 2009. Google Scholar