Document

# Beyond the Longest Letter-Duplicated Subsequence Problem

## File

LIPIcs.CPM.2022.7.pdf
• Filesize: 0.65 MB
• 12 pages

## Cite As

Wenfeng Lai, Adiesha Liyanage, Binhai Zhu, and Peng Zou. Beyond the Longest Letter-Duplicated Subsequence Problem. In 33rd Annual Symposium on Combinatorial Pattern Matching (CPM 2022). Leibniz International Proceedings in Informatics (LIPIcs), Volume 223, pp. 7:1-7:12, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2022)
https://doi.org/10.4230/LIPIcs.CPM.2022.7

## Abstract

Motivated by computing duplication patterns in sequences, a new fundamental problem called the longest letter-duplicated subsequence (LLDS) is proposed. Given a sequence S of length n, a letter-duplicated subsequence is a subsequence of S in the form of x₁^{d₁}x₂^{d₂}⋯ x_k^{d_k} with x_i ∈ Σ, x_j≠ x_{j+1} and d_i ≥ 2 for all i in [k] and j in [k-1]. A linear time algorithm for computing the longest letter-duplicated subsequence (LLDS) of S can be easily obtained. In this paper, we focus on two variants of this problem. We first consider the constrained version when Σ is unbounded, each letter appears in S at least 6 times and all the letters in Σ must appear in the solution. We show that the problem is NP-hard (a further twist indicates that the problem does not admit any polynomial time approximation). The reduction is from possibly the simplest version of SAT that is NP-complete, (≤ 2,1, ≤ 3)-SAT, where each variable appears at most twice positively and exact once negatively, and each clause contains at most three literals and some clauses must contain exactly two literals. (We hope that this technique will serve as a general tool to help us proving the NP-hardness for some more tricky sequence problems involving only one sequence - much harder than with at least two input sequences, which we apply successfully at the end of the paper on some extra variations of the LLDS problem.) We then show that when each letter appears in S at most 3 times, then the problem admits a factor 1.5-O(1/n) approximation. Finally, we consider the weighted version, where the weight of a block x_i^{d_i} (d_i ≥ 2) could be any positive function which might not grow with d_i. We give a non-trivial O(n²) time dynamic programming algorithm for this version, i.e., computing an LD-subsequence of S whose weight is maximized.

## Subject Classification

##### ACM Subject Classification
• Theory of computation
##### Keywords
• Segmental duplications
• Tandem duplications
• Longest common subsequence
• NP-completeness
• Dynamic programming

## Metrics

• Access Statistics
• Total Accesses (updated on a weekly basis)
0

## References

1. Daniel P. Bovet and Stefano Varricchio. On the regularity of languages on a binary alphabet generated by copying systems. Information Processing Letters, 44(3):119-123, 1992.
2. Ferdinando Cicalese and Nicolo Pilati. The tandem duplication distance problem is hard over bounded alphabets. In Paola Flocchini and Lucia Moura, editors, Combinatorial Algorithms - 21st International Workshop, IWOCA 2021, Ottawa, Canada, July 5-7, 2021, volume 12757 of Lecture Notes in Computer Science, pages 179-193. Springer, 2021.
3. Giovanni Ciriello, Martin L Miller, Bülent Arman Aksoy, Yasin Senbabaoglu, Nikolaus Schultz, and Chris Sander. Emerging landscape of oncogenic signatures across human cancers. Nature Genetics, 45:1127-1133, 2013.
4. Juegen Dassow, Victor Mitrana, and Gheorghe Paun. On the regularity of the duplication closure. Bulletin of the EATCS, 69:133-136, 1999.
5. Andrzej Ehrenfeucht and Grzegorz Rozenberg. On regularity of languages generated by copying systems. Discrete Applied Mathematics, 8(3):313-317, 1984.
6. Adrian Kosowski. An efficient algorithm for the longest tandem scattered subsequence problem. In Alberto Apostolico and Massimo Melucci, editors, String Processing and Information Retrieval, 11th International Conference, SPIRE 2004, Padova, Italy, October 5-8, 2004, Proceedings, volume 3246 of Lecture Notes in Computer Science, pages 93-100. Springer, 2004.
7. Manuel Lafond, Binhai Zhu, and Peng Zou. The tandem duplication distance is NP-hard. In Christophe Paul and Markus Bläser, editors, 37th International Symposium on Theoretical Aspects of Computer Science, STACS 2020, March 10-13, 2020, Montpellier, France, volume 154 of LIPIcs, pages 15:1-15:15. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2020.
8. Manuel Lafond, Binhai Zhu, and Peng Zou. Computing the tandem duplication distance is NP-hard. SIAM J. Discrete Mathematics, 36(1):64-91, 2022.
9. E.S. Lander, et al., and International Human Genome Sequencing Consortium. Initial sequencing and analysis of the human genome. Nature, 409(6822):860-921, 2001.
10. John Leech. A problem on strings of beads. The Mathematical Gazette, 41(338):277-278, 1957.
11. Marcy Macdonald, et al., and Peter S. Harper. A novel gene containing a trinucleotide repeat that is expanded and unstable on huntington’s disease. Cell, 72(6):971-983, 1993.
12. The Cancer Genome Atlas Research Network. Integrated genomic analyses of ovarian carcinoma. Nature, 474:609-615, 2011.
13. Layla Oesper, Anna M. Ritz, Sarah J. Aerni, Ryan Drebin, and Benjamin J. Raphael. Reconstructing cancer genomes from paired-end sequencing data. BMC Bioinformatics, 13(Suppl 6):S10, 2012.
14. Thomas J. Schaefer. The complexity of satisfiability problems. In Richard J. Lipton, Walter A. Burkhard, Walter J. Savitch, Emily P. Friedman, and Alfred V. Aho, editors, Proceedings of the 10th Annual ACM Symposium on Theory of Computing, May 1-3, 1978, San Diego, California, USA, pages 216-226. ACM, 1978.
15. Sven Schrinner, Manish Goel, Michael Wulfert, Philipp Spohr, Korbinian Schneeberger, and Gunnar W. Klau. The longest run subsequence problem. In Carl Kingsford and Nadia Pisanti, editors, 20th International Workshop on Algorithms in Bioinformatics, WABI 2020, September 7-9, 2020, Pisa, Italy (Virtual Conference), volume 172 of LIPIcs, pages 6:1-6:13. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2020.
16. Andrew J. Sharp, et al., and Even E. Eichler. Segmental duplications and copy-number variation in the human genome. The American J. of Human Genetics, 77(1):78-88, 2005.
17. Jack W. Szostak and Ray Wu. Unequal crossing over in the ribosomal dna of saccharomyces cerevisiae. Nature, 284:426-430, 1980.
18. Craig A. Tovey. A simplified np-complete satisfiability problem. Discret. Appl. Math., 8(1):85-89, 1984.
19. Ming-Wei Wang. On the irregularity of the duplication closure. Bulletin of the EATCS, 70:162-163, 2000.
20. Chunfang Zheng, P Kerr Wall, James Leebens-Mack, Claude de Pamphilis, Victor A Albert, and David Sankoff. Gene loss under neighborhood selection following whole genome duplication and the reconstruction of the ancestral populus genome. Journal of Bioinformatics and Computational Biology, 7(03):499-520, 2009.