The Tandem Duplication Distance Is NP-Hard

Authors Manuel Lafond, Binhai Zhu, Peng Zou



PDF
Thumbnail PDF

File

LIPIcs.STACS.2020.15.pdf
  • Filesize: 0.51 MB
  • 15 pages

Document Identifiers

Author Details

Manuel Lafond
  • Department of Computer Science, Universite de Sherbrooke, Sherbrooke, Quebec J1K 2R1, Canada
Binhai Zhu
  • Gianforte School of Computing, Montana State University, Bozeman, MT 59717, USA
Peng Zou
  • Gianforte School of Computing, Montana State University, Bozeman, MT 59717, USA

Cite AsGet BibTex

Manuel Lafond, Binhai Zhu, and Peng Zou. The Tandem Duplication Distance Is NP-Hard. In 37th International Symposium on Theoretical Aspects of Computer Science (STACS 2020). Leibniz International Proceedings in Informatics (LIPIcs), Volume 154, pp. 15:1-15:15, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2020)
https://doi.org/10.4230/LIPIcs.STACS.2020.15

Abstract

In computational biology, tandem duplication is an important biological phenomenon which can occur either at the genome or at the DNA level. A tandem duplication takes a copy of a genome segment and inserts it right after the segment - this can be represented as the string operation AXB ⇒ AXXB. Tandem exon duplications have been found in many species such as human, fly or worm, and have been largely studied in computational biology. The Tandem Duplication (TD) distance problem we investigate in this paper is defined as follows: given two strings S and T over the same alphabet, compute the smallest sequence of tandem duplications required to convert S to T. The natural question of whether the TD distance can be computed in polynomial time was posed in 2004 by Leupold et al. and had remained open, despite the fact that tandem duplications have received much attention ever since. In this paper, we prove that this problem is NP-hard, settling the 16-year old open problem. We further show that this hardness holds even if all characters of S are distinct. This is known as the exemplar TD distance, which is of special relevance in bioinformatics. One of the tools we develop for the reduction is a new problem called the Cost-Effective Subgraph, for which we obtain W[1]-hardness results that might be of independent interest. We finally show that computing the exemplar TD distance between S and T is fixed-parameter tractable. Our results open the door to many other questions, and we conclude with several open problems.

Subject Classification

ACM Subject Classification
  • Theory of computation
Keywords
  • Tandem duplication
  • Text processing
  • Formal languages
  • Computational genomics
  • FPT algorithms

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Noga Alon, Jehoshua Bruck, Farzad F. Hassanzadeh, and Siddharth Jain. Duplication distance to the root for binary sequences. IEEE Transactions on Information Theory, 63(12):7793-7803, 2017. URL: https://doi.org/10.1109/TIT.2017.2744608.
  2. Gary Benson and Lan Dong. Reconstructing the duplication history of a tandem repeat. In Proceedings of the 17th International Conference on Intelligent Systems for Molecular Biology, ISMB'99, pages 44-53. AAAI, 1999. Google Scholar
  3. Daniel P. Bovet and Stefano Varricchio. On the regularity of languages on a binary alphabet generated by copying systems. Information Processing Letters, 44(3):119-123, 1992. URL: https://doi.org/10.1016/0020-0190(92)90050-6.
  4. Laurent Bulteau, Guillaume Fertin, and Irena Rusu. Sorting by transpositions is difficult. SIAM Journal on Discrete Mathematics, 26(3):1148-1180, 2012. URL: https://doi.org/10.1137/110851390.
  5. Brian Charlesworth, Paul Sniegowski, and Wolfgang Stephan. The evolutionary dynamics of repetitive dna in eukaryotes. Nature, 371:215-220, 1994. URL: https://doi.org/10.1038/371215a0.
  6. Kamalika Chaudhuri, Kevin Chen, Radu Mihaescu, and Satish Rao. On the tandem duplication-random loss model of genome rearrangement. In Proceedings of 17th ACM-SIAM Symposium on Discrete Algorithms, SODA'06, pages 564-570. SIAM, 2006. Google Scholar
  7. Zhi-Zhong Chen, Lusheng Wang, and Zhanyong Wang. Approximation algorithms for reconstructing the duplication history of tandem repeats. Algorithmica, 54(4):501-529, 2009. URL: https://doi.org/10.1007/s00453-008-9209-8.
  8. Da-Jung Cho, Yo-Sub Han, and Hwee Kim. Bound-decreasing duplication system. Theoretical Computer Science, 793:152-168, 2019. URL: https://doi.org/10.1016/j.tcs.2019.06.018.
  9. Juegen Dassow, Victor Mitrana, and Gheorghe Paun. On the regularity of the duplication closure. Bulletin of the EATCS, 69:133-136, 1999. Google Scholar
  10. Rodney G. Downey and Michael R. Fellows. Fixed-parameter tractability and completeness ii: On completeness for w[1]. Theoretical Computer Science, 141(1–2):109-131, 1995. URL: https://doi.org/10.1016/0304-3975(94)00097-3.
  11. Andrzej Ehrenfeucht and Grzegorz Rozenberg. On regularity of languages generated by copying systems. Discrete Applied Mathematics, 8(3):313-317, 1984. URL: https://doi.org/10.1016/0166-218X(84)90129-X.
  12. Andrew J. Sharp et al. and Even E. Eichler. Segmental duplications and copy-number variation in the human genome. The American J. of Human Genetics, 77(1):78-88, 2005. URL: https://doi.org/10.1086/431652.
  13. Marcy Macdonald et al. and Peter S. Harper. A novel gene containing a trinucleotide repeat that is expanded and unstable on huntington’s disease. Cell, 72(6):971-983, 1993. URL: https://doi.org/10.1016/0092-8674(93)90585-E.
  14. Olivier Gascuel, Michael D. Hendy, Alain Jean-Marie, and Robert McLachlan. The combinatorics of tandem duplication trees. Systematic Biology, 52(1):110-118, 2003. URL: https://doi.org/10.1080/10635150390132821.
  15. Dan Gusfield. Algorithms on strings, trees and sequences: computer science and computational biology. Cambridge University Press, 1997. Google Scholar
  16. Dan Gusfield and Jens Stoye. Linear time algorithms for finding and representing all the tandem repeats in a string. Journal of Computer and System Sciences, 69(4):525-546, 2004. URL: https://doi.org/10.1016/j.jcss.2004.03.004.
  17. Sridhar Hannenhalli and Pavel A. Pevzner. Transforming men into mice (polynomial algorithm for genomic distance problem). In Proceedings of IEEE 36th Symposium on Foundations of Computer Science, FOCS'95, pages 581-592. IEEE, 1995. URL: https://doi.org/10.1109/SFCS.1995.492588.
  18. Farzad F. Hassanzadeh, Moshe Schwartz, and Jehoshua Bruck. The capacity of string-duplication systems. IEEE Transactions on Information Theory, 62(2):811-824, 2016. URL: https://doi.org/10.1109/TIT.2015.2505735.
  19. Masami Ito, Peter Leupold, and Kayoko Shikishima-Tsuji. Closure of languages under bounded duplications. In Proceedings of the 10th International Conference on Developments in Language Theory, volume 4036 of Lecture Notes in Computer Science, pages 238-247. Spring-Verlag, 2006. URL: https://doi.org/10.1007/11779148_22.
  20. Siddharth Jain, Farzad F. Hassanzadeh, and Jehoshua Bruck. Capacity and expressiveness of genomic tandem duplication. IEEE Transactions on Information Theory, 63(10):6129-6138, 2017. URL: https://doi.org/10.1109/TIT.2017.2728079.
  21. Richard M. Karp. Reducibility among combinatorial problems. In Raymond E. Miller, James W. Thatcher, and Jean D. Bohlinger, editors, Complexity of Computer Computations, The IBM Research Symposia Series, pages 85-103. Springer, Boston, MA, 1972. URL: https://doi.org/10.1007/978-1-4684-2001-2_9.
  22. Gad M. Landau, Jeanette P. Schmidt, and Dina Sokol. An algorithm for approximate tandem repeats. Journal of Computational Biology, 8(1):1-18, 2001. URL: https://doi.org/10.1089/106652701300099038.
  23. Ivica Letunic, Richard R. Copley, and Peer Bork. Common exon duplication in animals and its role in alternative splicing. Human Molecular Genetics, 11(13):1561-1567, 2002. URL: https://doi.org/10.1093/hmg/11.13.1561.
  24. Peter Leupold, Carlos Martin-Vide, and Victor Mitrana. Uniformly bounded duplication languages. Discrete Applied Mathematics, 146(3):301-310, 2005. URL: https://doi.org/10.1016/j.dam.2004.10.003.
  25. Peter Leupold, Victor Mitrana, and Jose M. Sempere. Formal languages arising from gene repeated duplication. In Gheorghe Paun Natasa Jonoska and Grzegorz Rozenberg, editors, Aspects of Molecular Computing, volume 2950 of Lecture Notes in Computer Science, pages 297-308. Springer-Verlag, Berlin, 2004. URL: https://doi.org/10.1007/978-3-540-24635-0_22.
  26. Layla Oesper, Anna M. Ritz, Sarah J. Aerni, Ryan Drebin, and Benjamin J. Raphael. Reconstructing cancer genomes from paired-end sequencing data. BMC Bioinformatics, 13(Suppl 6):S10, 2012. URL: https://doi.org/10.1186/1471-2105-13-S6-S10.
  27. Letu Qingge, Xiaozhou He, Zhihui Liu, and Binhai Zhu. On the minimum copy number generation problem in cancer genomics. In Proceedings of the 10th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics, BCB'18, pages 260-269, New York, NY, 2018. ACM Press. URL: https://doi.org/10.1145/3233547.3233586.
  28. David Sankoff. Gene and genome duplication. Current opinion in genetics and development, 11(6):681-684, 2001. URL: https://doi.org/10.1016/S0959-437X(00)00253-7.
  29. Jack W. Szostak and Ray Wu. Unequal crossing over in the ribosomal dna of saccharomyces cerevisiae. Nature, 284:426-430, 1980. URL: https://doi.org/10.1038/284426a0.
  30. Olivier Tremblay-Savard, Denis Bertrand, and Nadia El-Mabrouk. Evolution of orthologous tandemly arrayed gene clusters. BMC Bioinformatics, 12(Suppl 9):S2, 2011. URL: https://doi.org/10.1186/1471-2105-12-S9-S2.
  31. Ming-Wei Wang. On the irregularity of the duplication closure. Bulletin of the EATCS, 70:162-163, 2000. Google Scholar
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail