TRACTION: Fast Non-Parametric Improvement of Estimated Gene Trees

Authors Sarah Christensen , Erin K. Molloy , Pranjal Vachaspati , Tandy Warnow



PDF
Thumbnail PDF

File

LIPIcs.WABI.2019.4.pdf
  • Filesize: 0.55 MB
  • 16 pages

Document Identifiers

Author Details

Sarah Christensen
  • University of Illinois at Urbana-Champaign, USA
Erin K. Molloy
  • University of Illinois at Urbana-Champaign, USA
Pranjal Vachaspati
  • University of Illinois at Urbana-Champaign, USA
Tandy Warnow
  • University of Illinois at Urbana-Champaign, USA

Acknowledgements

We thank Mike Steel for encouragement and the members of the Warnow lab for valuable feedback. This study was performed on the Illinois Campus Cluster and Blue Waters, a computing resource that is operated and financially supported by UIUC in conjunction with the National Center for Supercomputing Applications.

Cite AsGet BibTex

Sarah Christensen, Erin K. Molloy, Pranjal Vachaspati, and Tandy Warnow. TRACTION: Fast Non-Parametric Improvement of Estimated Gene Trees. In 19th International Workshop on Algorithms in Bioinformatics (WABI 2019). Leibniz International Proceedings in Informatics (LIPIcs), Volume 143, pp. 4:1-4:16, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2019)
https://doi.org/10.4230/LIPIcs.WABI.2019.4

Abstract

Gene tree correction aims to improve the accuracy of a gene tree by using computational techniques along with a reference tree (and in some cases available sequence data). It is an active area of research when dealing with gene tree heterogeneity due to duplication and loss (GDL). Here, we study the problem of gene tree correction where gene tree heterogeneity is instead due to incomplete lineage sorting (ILS, a common problem in eukaryotic phylogenetics) and horizontal gene transfer (HGT, a common problem in bacterial phylogenetics). We introduce TRACTION, a simple polynomial time method that provably finds an optimal solution to the RF-Optimal Tree Refinement and Completion Problem, which seeks a refinement and completion of an input tree t with respect to a given binary tree T so as to minimize the Robinson-Foulds (RF) distance. We present the results of an extensive simulation study evaluating TRACTION within gene tree correction pipelines on 68,000 estimated gene trees, using estimated species trees as reference trees. We explore accuracy under conditions with varying levels of gene tree heterogeneity due to ILS and HGT. We show that TRACTION matches or improves the accuracy of well-established methods from the GDL literature under conditions with HGT and ILS, and ties for best under the ILS-only conditions. Furthermore, TRACTION ties for fastest on these datasets. TRACTION is available at https://github.com/pranjalv123/TRACTION-RF and the study datasets are available at https://doi.org/10.13012/B2IDB-1747658_V1.

Subject Classification

ACM Subject Classification
  • Applied computing → Molecular evolution
  • Applied computing → Population genetics
Keywords
  • Gene tree correction
  • horizontal gene transfer
  • incomplete lineage sorting

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. M.S. Bansal. Linear-Time Algorithms for Some Phylogenetic Tree Completion Problems Under Robinson-Foulds Distance. In M. Blanchette and A. Ouangraoua, editors, Comparative Genomics. RECOMB-CG 2018. Lecture Notes in Computer Science, vol 11183. Springer, 2018. Google Scholar
  2. M.S. Bansal, Y.-C. Wu, E.J. Alm, and M. Kellis. Improved gene tree error correction in the presence of horizontal gene transfer. Bioinformatics, 31(8):1211-1218, 2015. Google Scholar
  3. Md Shamsuzzoha Bayzid, Siavash Mirarab, Bastien Boussau, and Tandy Warnow. Weighted statistical binning: enabling statistically consistent genome-scale phylogenetic analyses. PLoS One, 10(6):e0129183, 2015. Google Scholar
  4. R. Chaudhary, J.G. Burleigh, and O. Eulenstein. Efficient error correction algorithms for gene tree reconciliation based on duplication, duplication and loss, and deep coalescence. BMC Bioinformatics, 13(10):S11, 2012. Google Scholar
  5. Ruchi Chaudhary, John Gordon Burleigh, and David Fernández-Baca. Inferring species trees from incongruent multi-copy gene trees using the Robinson-Foulds distance. Algorithms for Molecular Biology, 8(1):28, 2013. Google Scholar
  6. K. Chen, D. Durand, and M. Farach-Colton. NOTUNG: a program for dating gene duplications and optimizing gene family trees. Journal of Computational Biology, 7(3-4):429-447, 2000. Google Scholar
  7. S. Christensen, E.K. Molloy, P. Vachaspati, and T. Warnow. OCTAL: optimal completion of gene trees in polynomial time. Algorithms for Molecular Biology, 13(1):6, March 2018. Google Scholar
  8. R. Davidson, P. Vachaspati, S. Mirarab, and T. Warnow. Phylogenomic species tree estimation in the presence of incomplete lineage sorting and horizontal gene transfer. BMC Genomics, 16:S1, 2015. Google Scholar
  9. D. Durand, B.V. Halldórsson, and B. Vernot. A hybrid micro-macroevolutionary approach to gene tree reconstruction. Journal of Computational Biology, 13(2):320-335, 2006. Google Scholar
  10. S.V. Edwards. Is a new and general theory of molecular systematics emerging? Evolution, 63(1):1-19, 2009. Google Scholar
  11. George F Estabrook, CS Johnson Jr, and Fred R Mc Morris. An idealized concept of the true cladistic character. Mathematical Biosciences, 23(3-4):263-272, 1975. Google Scholar
  12. George F Estabrook, CS Johnson Jr, and FR McMorris. A mathematical foundation for the analysis of cladistic character compatibility. Mathematical Biosciences, 29(1-2):181-187, 1976. Google Scholar
  13. W. Fletcher and Z. Yang. INDELible: A Flexible Simulator of Biological Sequence Evolution. Molecular Biology and Evolution, 26(8):1879-1888, 2009. 10.1093/molbev/msp098. Google Scholar
  14. P. Gawrychowski, G.M. Landau, W.-K. Sung, and O. Weimann. A Faster Construction of Phylogenetic Consensus Trees. arXiv preprint, 2017. URL: http://arxiv.org/abs/1705.10548.
  15. E. Jacox, C. Chauve, G.J. Szöllősi, Y. Ponty, and C. Scornavacca. ecceTERA: comprehensive gene tree-species tree reconciliation using parsimony. Bioinf., 32(13):2056-2058, 2016. Google Scholar
  16. E. Jacox, M. Weller, E. Tannier, and C. Scornavacca. Resolution and reconciliation of non-binary gene trees with transfers, duplications and losses. Bioinf., 33(7):980-987, 2017. Google Scholar
  17. E.D. Jarvis, S. Mirarab, A.J. Aberer, B. Li, P. Houde, C. Li, S. Ho, B.C. Faircloth, B. Nabholz, J.T. Howard, et al. Whole-genome analyses resolve early branches in the tree of life of modern birds. Science, 346(6215):1320-1331, 2014. Google Scholar
  18. M. Lafond, C. Chauve, N. El-Mabrouk, and A. Ouangraoua. Gene tree construction and correction using supertree and reconciliation. IEEE/ACM Trans Comp Biol Bioinf (TCBB), 15(5):1560-1570, 2018. Google Scholar
  19. H. Lai, M. Stolzer, and D. Durand. Fast Heuristics for Resolving Weakly Supported Branches Using Duplication, Transfers, and Losses. In J. Meidanis and L. Nakhleh, editors, Comparative Genomics, pages 298-320, Cham, 2017. Springer International Publishing. Google Scholar
  20. Vincent Lefort, Richard Desper, and Olivier Gascuel. FastME 2.0: A Comprehensive, Accurate, and Fast Distance-Based Phylogeny Inference Program. Molecular Biology and Evolution, 32(10):2798-2800, 2015. Google Scholar
  21. W. Maddison. Gene Trees in Species Trees. Systematic Biology, 46(3):523-536, 1997. Google Scholar
  22. D. Mallo, L. Martins, and D. Posada. SimPhy: phylogenomic simulation of gene, locus, and species trees. Systematic Biology, 65(2):334-344, 2016. Google Scholar
  23. S. Mirarab, M.S. Bayzid, B. Boussau, and T. Warnow. Statistical binning enables an accurate coalescent-based estimation of the avian tree. Science, 346(6215), 2014. URL: https://doi.org/10.1126/science.1250463.
  24. S. Mirarab and T. Warnow. ASTRAL-II: Coalescent-based Species Tree Estimation with Many Hundreds of Taxa and Thousands of Genes. Bioinformatics, 31(12):i44, 2015. Google Scholar
  25. E. Molloy and T. Warnow. To include or not to include: The impact of gene filtering on species tree estimation methods. Systematic Biology, 67(2):285–303, 2018. Google Scholar
  26. T.H. Nguyen, V. Ranwez, S. Pointet, A.-M. Chifolleau, J.-P. Doyon, and V. Berry. Reconciliation and local gene tree rearrangement can be of mutual profit. Algorithms for Molecular Biology, 8(1):1, 2013. Google Scholar
  27. E. Noutahi, M. Semeria, M. Lafond, J. Seguin, B. Boussau, L. Guéguen, N. El-Mabrouk, and E. Tannier. Efficient gene tree correction guided by genome evolution. PLoS One, 11(8):e0159559, 2016. Google Scholar
  28. D.F. Robinson and L.R. Foulds. Comparison of phylogenetic trees. Mathematical biosciences, 53(1-2):131-147, 1981. Google Scholar
  29. A. Stamatakis. RAxML Version 8: A tool for Phylogenetic Analysis and Post-Analysis of Large Phylogenies. Bioinformatics, 30(9), 2014. 10.1093/bioinformatics/btu033. Google Scholar
  30. J Sukumaran and M.T. Holder. DendroPy: a Python library for phylogenetic computing. Bioinformatics, 26(12):1569-1571, 2010. 10.1093/bioinformatics/btq228. Google Scholar
  31. G.J. Szöllősi, W. Rosikiewicz, B. Boussau, E. Tannier, and V. Daubin. Efficient exploration of the space of reconciled gene trees. Systematic Biology, 62(6):901-912, 2013. Google Scholar
  32. P. Vachaspati and T. Warnow. ASTRID: Accurate Species Trees from Internode Distances. BMC Genomics, 16(10):S3, 2015. 10.1186/1471-2164-16-S10-S3. Google Scholar
  33. Y.-C. Wu, M.D. Rasmussen, M.S. Bansal, and M. Kellis. TreeFix: statistically informed gene tree error correction using species trees. Systematic Biology, 62(1):110-120, 2012. Google Scholar
  34. Y. Zheng and L. Zhang. Reconciliation With Nonbinary Gene Trees Revisited. Journal of the ACM (JACM), 64(4):24, 2017. Google Scholar