Simultaneous Reconstruction of Duplication Episodes and Gene-Species Mappings

Authors Paweł Górecki , Natalia Rutecka, Agnieszka Mykowiecka , Jarosław Paszek

Thumbnail PDF


  • Filesize: 0.92 MB
  • 18 pages

Document Identifiers

Author Details

Paweł Górecki
  • Faculty of Mathematics, Informatics and Mechanics, University of Warsaw, Poland
Natalia Rutecka
  • Faculty of Mathematics, Informatics and Mechanics, University of Warsaw, Poland
Agnieszka Mykowiecka
  • Faculty of Mathematics, Informatics and Mechanics, University of Warsaw, Poland
Jarosław Paszek
  • Faculty of Mathematics, Informatics and Mechanics, University of Warsaw, Poland

Cite AsGet BibTex

Paweł Górecki, Natalia Rutecka, Agnieszka Mykowiecka, and Jarosław Paszek. Simultaneous Reconstruction of Duplication Episodes and Gene-Species Mappings. In 23rd International Workshop on Algorithms in Bioinformatics (WABI 2023). Leibniz International Proceedings in Informatics (LIPIcs), Volume 273, pp. 6:1-6:18, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2023)


We present a novel problem, called MetaEC, which aims to infer gene-species assignments in a collection of gene trees with missing labels by minimizing the size of duplication episode clustering (EC). This problem is particularly relevant in metagenomics, where incomplete data often poses a challenge in the accurate reconstruction of gene histories. To solve MetaEC, we propose a polynomial time dynamic programming (DP) formulation that verifies the existence of a set of duplication episodes from a predefined set of episode candidates. We then demonstrate how to use DP to design an algorithm that solves MetaEC. Although the algorithm is exponential in the worst case, we introduce a heuristic modification of the algorithm that provides a solution with the knowledge that it is exact. To evaluate our method, we perform two computational experiments on simulated and empirical data containing whole genome duplication events, showing that our algorithm is able to accurately infer the corresponding events.

Subject Classification

ACM Subject Classification
  • Mathematics of computing → Combinatorial optimization
  • Applied computing → Computational genomics
  • Genomic Duplication
  • Gene-Species Mapping
  • Duplication Episode
  • Gene Tree
  • Species Tree


  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    PDF Downloads


  1. Mukul S Bansal and Oliver Eulenstein. The multiple gene duplication problem revisited. Bioinformatics, 24(13):i132-i138, 2008. Google Scholar
  2. Arkadiusz Betkier, Paweł Szczęsny, and Paweł Górecki. Fast algorithms for inferring gene-species associations. In Bioinformatics Research and Applications: 11th International Symposium, ISBRA 2015 Norfolk, USA, June 7-10, 2015 Proceedings 11, pages 36-47. Springer, 2015. Google Scholar
  3. Craig M. Bielski, Ahmet Zehir, Alexander V. Penson, Mark T. A. Donoghue, Walid Chatila, Joshua Armenia, Matthew T. Chang, Alison M. Schram, Philip Jonsson, Chaitanya Bandlamudi, Pedram Razavi, Gopa Iyer, Mark E. Robson, Zsofia K. Stadler, Nikolaus Schultz, Jose Baselga, David B. Solit, David M. Hyman, Michael F. Berger, and Barry S. Taylor. Genome doubling shapes the evolution and prognosis of advanced cancers. Nature Genetics, 50(8):1189-1195, 2018. Google Scholar
  4. J Gordon Burleigh, Mukul S Bansal, Andre Wehe, and Oliver Eulenstein. Locating multiple gene duplications through reconciled trees. In Research in Computational Molecular Biology: 12th Annual International Conference, RECOMB 2008, Singapore, March 30-April 2, 2008. Proceedings 12, pages 273-284. Springer, 2008. Google Scholar
  5. The UniProt Consortium. Uniprot: the universal protein knowledgebase in 2023. Nucleic Acids Research, 51(D1):D523-D531, 2023. Google Scholar
  6. Riccardo Dondi, Manuel Lafond, and Celine Scornavacca. Reconciling multiple genes trees via segmental duplications and losses. Algorithms for Molecular Biology, 14:7, 2019. Google Scholar
  7. Robert C Edgar. Muscle: a multiple sequence alignment method with reduced time and space complexity. BMC Bioinformatics, 5(1):1-19, 2004. Google Scholar
  8. Michael Fellows, Michael Hallet, and Ulrike Stege. On the multiple gene duplication problem. In 9th International Symposium on Algorithms and Computation (ISAAC'98), Lecture Notes in Computer Science 1533, pages 347-356, Taejon, Korea, 1998. Google Scholar
  9. Bing Feng, Yu Lin, Lingxi Zhou, Yan Guo, Robert Friedman, Ruofan Xia, Fei Hu, Chao Liu, and Jijun Tang. Reconstructing yeasts phylogenies and ancestors from whole genome data. Scientific Reports, 7(1):1-12, 2017. Google Scholar
  10. Morris Goodman, John Czelusniak, G. William Moore, A. E. Romero-Herrera, and Genji Matsuda. Fitting the gene lineage into its species lineage, a parsimony strategy illustrated by cladograms constructed from globin sequences. Systematic Zoology, 28(2):132-163, 1979. Google Scholar
  11. Paweł Górecki and Jerzy Tiuryn. DLS-trees: A model of evolutionary scenarios. Theoretical Computer Science, 359(1-3):378-399, 2006. Google Scholar
  12. Paweł Górecki and Jerzy Tiuryn. Urec: a system for unrooted reconciliation. Bioinformatics, 23(4):511-512, 2007. Google Scholar
  13. Roderic Guigó, Ilya B. Muchnik, and Temple F. Smith. Reconstruction of ancient molecular phylogeny. Molecular Phylogenetics and Evolution, 6(2):189-213, 1996. Google Scholar
  14. Stéphane Guindon, Jean-François Dufayard, Lefort Vincent, Maria Anisimova, Wim Hordijk, and Olivier Gascuel. New algorithms and methods to estimate maximum-likelihood phylogenies: Assessing the performance of phyml 3.0. Systematic Biology, 59(3):307-321, 2010. Google Scholar
  15. Leo Van Iersel, Remie Janssen, Mark Jones, Yukihiro Murakami, and Norbert Zeh. Polynomial-Time Algorithms for Phylogenetic Inference Problems involving duplication and reticulation. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2019. Google Scholar
  16. Elena Kuzmin, Benjamin VanderSluis, Alex N. Nguyen Ba, Wen Wang, Elizabeth N. Koch, Matej Usaj, Anton Khmelinskii, Mojca Mattiazzi Usaj, Jolanda van Leeuwen, Oren Kraus, Amy Tresenrider, Michael Pryszlak, Ming-Che Hu, Brenda Varriano, Michael Costanzo, Michael Knop, Alan Moses, Chad L. Myers, Brenda J. Andrews, and Charles Boone. Exploring whole-genome duplicate gene retention with complex genetic interaction analysis. Science, 368(6498):eaaz5667, 2020. Google Scholar
  17. Cheng-Wei Luo, Ming-Chiang Chen, Yi-Ching Chen, Roger W. L. Yang, Hsiao-Fei Liu, and Kun-Mao Chao. Linear-time algorithms for the multiple gene duplication problems. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 8(1):260-265, 2011. Google Scholar
  18. Saioa López, Emilia L Lim, Stuart Horswell, Kerstin Haase, Ariana Huebner, Michelle Dietzen, Thanos P Mourikis, Thomas B K Watkins, Andrew Rowan, Sally M Dewhurst, Nicolai J Birkbak, Gareth A Wilson, Peter Van Loo, Mariam Jamal-Hanjani, TRACERx Consortium, Charles Swanton, and Nicholas McGranahan. Interplay between whole-genome doubling and the accumulation of deleterious alterations in cancer evolution. Nature Genetics, 52(3):283-293, 2020. Google Scholar
  19. Bin Ma, Ming Li, and Louxin Zhang. From gene trees to species trees. SIAM Journal on Computing, 30(3):729-752, 2000. Google Scholar
  20. Diego Mallo, Leonardo De Oliveira Martins, and David Posada. Simphy: Phylogenomic simulation of gene, locus, and species trees. Systematic Biology, 65(2):334-344, 2016. Google Scholar
  21. Marina Marcet-Houben and Toni Gabaldón. Beyond the whole-genome duplication: phylogenetic evidence for an ancient interspecies hybridization in the baker’s yeast lineage. PLoS biology, 13(8):e1002220, 2015. Google Scholar
  22. Vacharapat Mettanant and Jittat Fakcharoenphol. A linear-time algorithm for the multiple gene duplication problem. In The 12th National Computer Science and Engineering Conference (NCSEC), pages 198-203, 2008. Google Scholar
  23. Erin K Molloy and Tandy Warnow. FastMulRFS: fast and accurate species tree estimation under generic gene duplication and loss models. Bioinformatics, 36(Supplement_1):i57-i65, 2020. Google Scholar
  24. Agnieszka Mykowiecka, Paweł Szczęsny, and Paweł Górecki. Inferring gene-species assignments in the presence of horizontal gene transfer. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 15(5):1571-1578, 2017. Google Scholar
  25. Susumu Ohno. Evolution by gene duplication. Springer-Verlag, Berlin, 1970. Google Scholar
  26. Roderic D. M. Page. Maps Between Trees and Cladistic Analysis of Historical Associations among Genes, Organisms, and Areas. Systematic Biology, 43(1):58-77, 1994. Google Scholar
  27. Roderic D.M. Page and James A. Cotton. Vertebrate phylogenomics: Reconciled trees and gene duplications. Pacific Symposium on Biocomputing, pages 536-547, 2002. Google Scholar
  28. Jarosław Paszek and Paweł Górecki. Genomic duplication problems for unrooted gene trees. BMC Genomics, 17(1):165-175, 2016. Google Scholar
  29. Jarosław Paszek and Paweł Górecki. Efficient algorithms for genomic duplication models. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 15(5):1515-1524, 2018. Google Scholar
  30. Jarosław Paszek and Paweł Górecki. Inferring duplication episodes from unrooted gene trees. BMC Genomics, 19(S5), 2018. Google Scholar
  31. Jarosław Paszek, Alexey Markin, Paweł Górecki, and Oliver Eulenstein. Taming the duplication-loss-coalescence model with integer linear programming. Journal of Computational Biology, 28(8):758-773, 2021. Google Scholar
  32. Jarosław Paszek, Jerzy Tiuryn, and Paweł Górecki. Minimizing genomic duplication episodes. Computational Biology and Chemistry, 89:107260, 2020. Google Scholar
  33. Ryan J Quinton, Amanda DiDomizio, Marc A Vittoria, Kristýna Kotýnková, Carlos J Ticas, Sheena Patel, Yusuke Koga, Jasmine Vakhshoorzadeh, Nicole Hermance, Taruho S Kuroda, Neha Parulekar, Alison M Taylor, Amity L Manning, Joshua D Campbell, and Neil J Ganem. Whole-genome doubling confers unique genetic vulnerabilities on tumour cells. Nature, 590(7846):492-497, 2021. Google Scholar
  34. Matthew D. Rasmussen and Manolis Kellis. Unified modeling of gene duplication, loss, and coalescence using a locus tree. Genome Research, 22(4):755-765, 2012. Google Scholar
  35. Marta Royo-Llonch, Pablo Sánchez, Clara Ruiz-González, Guillem Salazar, Carlos Pedrós-Alió, Marta Sebastián, Karine Labadie, Lucas Paoli, Federico M. Ibarbalz, Lucie Zinger, Benjamin Churcheward, Tara Oceans Coordinators, Samuel Chaffron, Damien Eveillard, Eric Karsenti, Shinichi Sunagawa, Patrick Wincker, Lee Karp-Boss, Chris Bowler, and Silvia G. Acinas. Compendium of 530 metagenome-assembled bacterial and archaeal genomes from the polar Arctic Ocean. Nature Microbiology, 6(12):1561-1574, 2021. Google Scholar
  36. Ayelet Salman-Minkov, Niv Sabath, and Itay Mayrose. Whole-genome duplication as a key factor in crop domestication. Nature Plants, 2:16115, 2016. Google Scholar
  37. Stijn Van Dongen. Graph clustering via a discrete uncoupling process. SIAM Journal on Matrix Analysis and Applications, 30(1):121-141, 2008. Google Scholar
  38. Jakob Wirbel, Paul Theodor Pyl, Ece Kartal, Konrad Zych, Alireza Kashani, Alessio Milanese, Jonas S Fleck, Anita Y Voigt, Albert Palleja, Ruby Ponnudurai, Shinichi Sunagawa, Luis Pedro Coelho, Petra Schrotz-King, Emily Vogtmann, Nina Habermann, Emma Niméus, Andrew M Thomas, Paolo Manghi, Sara Gandini, Davide Serrano, Sayaka Mizutani, Hirotsugu Shiroma, Satoshi Shiba, Tatsuhiro Shibata, Shinichi Yachida, Takuji Yamada, Levi Waldron, Alessio Naccarati, Nicola Segata, Rashmi Sinha, Cornelia M Ulrich, Hermann Brenner, Manimozhiyan Arumugam, Peer Bork, and Georg Zeller. Meta-analysis of fecal metagenomes reveals global microbial signatures that are specific for colorectal cancer. Nature Medicine, 25(4):679-689, 2019. Google Scholar
  39. Kenneth H Wolfe and Denis C Shields. Molecular evidence for an ancient duplication of the entire yeast genome. Nature, 387(6634):708-713, 1997. Google Scholar
  40. Shan Wu, Kin H Lau, Qinghe Cao, John P Hamilton, Honghe Sun, Chenxi Zhou, Lauren Eserman, Dorcus C Gemenet, Bode A Olukolu, Haiyan Wang, Emily Crisovan, Grant T Godden, Chen Jiao, Xin Wang, Mercy Kitavi, Norma Manrique-Carpintero, Brieanne Vaillancourt, Krystle Wiegert-Rininger, Xinsun Yang, Kan Bao, Jennifer Schaff, Jan Kreuze, Wolfgang Gruneberg, Awais Khan, Marc Ghislain, Daifu Ma, Jiming Jiang, Robert O M Mwanga, Jim Leebens-Mack, Lachlan J M Coin, G Craig Yencho, C Robin Buell, and Zhangjun Fei. Genome sequences of two diploid wild relatives of cultivated sweetpotato reveal targets for genetic improvement. Nature Communications, 9(1):4580, 2018. Google Scholar
  41. Birsen Yilmaz, Emine Elibol, H Nakibapher Jones Shangpliang, Fatih Ozogul, and Jyoti Prakash Tamang. Microbial communities in home-made and commercial kefir and their hypoglycemic properties. Fermentation, 8(11):590, 2022. Google Scholar
  42. Louxin Zhang and Yun Cui. An efficient method for dna-based species assignment via gene tree and species tree reconciliation. In Algorithms in Bioinformatics: 10th International Workshop, WABI 2010, Liverpool, UK, September 6-8, 2010. Proceedings 10, pages 300-311. Springer, 2010. Google Scholar
  43. Jan Łukasiewicz. Selected Works, volume 1. North-Holland Publishing Company, Amsterdam, 1970. Google Scholar
Questions / Remarks / Feedback

Feedback for Dagstuhl Publishing

Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail