Fast and Accurate Species Trees from Weighted Internode Distances

Authors Baqiao Liu , Tandy Warnow



PDF
Thumbnail PDF

File

LIPIcs.WABI.2022.8.pdf
  • Filesize: 3.37 MB
  • 24 pages

Document Identifiers

Author Details

Baqiao Liu
  • Department of Computer Science, University of Illinois Urbana-Champaign, IL, USA
Tandy Warnow
  • Department of Computer Science, University of Illinois Urbana-Champaign, IL, USA

Acknowledgements

The authors thank the members of the Warnow lab for insightful comments.

Cite AsGet BibTex

Baqiao Liu and Tandy Warnow. Fast and Accurate Species Trees from Weighted Internode Distances. In 22nd International Workshop on Algorithms in Bioinformatics (WABI 2022). Leibniz International Proceedings in Informatics (LIPIcs), Volume 242, pp. 8:1-8:24, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2022)
https://doi.org/10.4230/LIPIcs.WABI.2022.8

Abstract

Species tree estimation is a basic step in many biological research projects, but is complicated by the fact that gene trees can differ from the species tree due to processes such as incomplete lineage sorting (ILS), gene duplication and loss (GDL), and horizontal gene transfer (HGT), which can cause different regions within the genome to have different evolutionary histories (i.e., "gene tree heterogeneity"). One approach to estimating species trees in the presence of gene tree heterogeneity resulting from ILS operates by computing trees on each genomic region (i.e., computing "gene trees") and then using these gene trees to define a matrix of average internode distances, where the internode distance in a tree T between two species x and y is the number of nodes in T between the leaves corresponding to x and y. Given such a matrix, a tree can then be computed using methods such as neighbor joining. Methods such as ASTRID and NJst (which use this basic approach) are provably statistically consistent, very fast (low degree polynomial time) and have had high accuracy under many conditions that makes them competitive with other popular species tree estimation methods. In this study, inspired by the very recent work of weighted ASTRAL, we present weighted ASTRID, a variant of ASTRID that takes the branch uncertainty on the gene trees into account in the internode distance. Our experimental study evaluating weighted ASTRID shows improvements in accuracy compared to the original (unweighted) ASTRID while remaining fast. Moreover, weighted ASTRID shows competitive accuracy against weighted ASTRAL, the state of the art. Thus, this study provides a new and very fast method for species tree estimation that improves upon ASTRID and has comparable accuracy with the state of the art while remaining much faster. Weighted ASTRID is available at https://github.com/RuneBlaze/internode.

Subject Classification

ACM Subject Classification
  • Applied computing → Molecular evolution
Keywords
  • Species tree estimation
  • ASTRID
  • ASTRAL
  • multi-species coalescent
  • incomplete lineage sorting

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. David J. Aldous. Stochastic Models and Descriptive Statistics for Phylogenetic Trees, from Yule to Today. Statistical Science, 16(1):23-34, 2001. URL: https://www.jstor.org/stable/2676778.
  2. Elizabeth S. Allman, James H. Degnan, and John A. Rhodes. Species tree inference from gene splits by unrooted star methods. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 15(1):337-342, 2018. URL: https://doi.org/10.1109/TCBB.2016.2604812.
  3. Maria Anisimova, Manuel Gil, Jean-François Dufayard, Christophe Dessimoz, and Olivier Gascuel. Survey of Branch Support Methods Demonstrates Accuracy, Power, and Robustness of Fast Likelihood-based Approximation Schemes. Systematic Biology, 60(5):685-699, October 2011. URL: https://doi.org/10.1093/sysbio/syr041.
  4. Md Shamsuzzoha Bayzid, Siavash Mirarab, Bastien Boussau, and Tandy Warnow. Weighted Statistical Binning: Enabling Statistically Consistent Genome-Scale Phylogenetic Analyses. PLOS ONE, 10(6):e0129183, June 2015. URL: https://doi.org/10.1371/journal.pone.0129183.
  5. Edward L. Braun and Rebecca T. Kimball. Data types and the phylogeny of neoaves. Birds, 2(1):1-22, 2021. URL: https://doi.org/10.3390/birds2010001.
  6. Julia Chifman and Laura Kubatko. Quartet Inference from SNP Data Under the Coalescent Model. Bioinformatics, 30(23):3317-3324, December 2014. URL: https://doi.org/10.1093/bioinformatics/btu530.
  7. Constantinos Daskalakis and Sébastien Roch. Species trees from gene trees despite a high rate of lateral genetic transfer: A tight bound. In Proceedings of the Twenty-Seventh Annual ACM-SIAM Symposium on Discrete Algorithms, pages 1621-1630. SIAM, 2016. Google Scholar
  8. Richard Desper and Olivier Gascuel. Fast and accurate phylogeny reconstruction algorithms based on the minimum-evolution principle. Journal of Computational Biology, 9(5):687-705, 2002. PMID: 12487758. URL: https://doi.org/10.1089/106652702761034136.
  9. Payam Dibaeinia, Shayan Tabe-Bordbar, and Tandy Warnow. FASTRAL: improving scalability of phylogenomic analysis. Bioinformatics, 37(16):2317-2324, August 2021. URL: https://doi.org/10.1093/bioinformatics/btab093.
  10. Péter L. Erdös, Michael A. Steel, László A. Székely, and Tandy J. Warnow. A few logs suffice to build (almost) all trees: Part II. Theoretical Computer Science, 221(1):77-118, June 1999. URL: https://doi.org/10.1016/S0304-3975(99)00028-6.
  11. Joseph Felsenstein. Confidence Limits on Phylogenies: An Approach Using the Bootstrap. Evolution, 39(4):783-791, 1985. URL: https://doi.org/10.1111/j.1558-5646.1985.tb00420.x.
  12. E. F. Harding. The probabilities of rooted tree-shapes generated by random bifurcation. Advances in Applied Probability, 3(1):44-77, 1971. URL: https://doi.org/10.2307/1426329.
  13. Max Hill, Brandon Legried, and Sébastien Roch. Species tree estimation under joint modeling of coalescence and duplication: sample complexity of quartet methods. arXiv preprint, 2020. URL: http://arxiv.org/abs/2007.06697.
  14. Erich D. Jarvis, Siavash Mirarab, Andre J. Aberer, Bo Li, Peter Houde, Cai Li, Simon Y. W. Ho, Brant C. Faircloth, Benoit Nabholz, Jason T. Howard, Alexander Suh, Claudia C. Weber, Rute R. da Fonseca, Jianwen Li, Fang Zhang, Hui Li, Long Zhou, Nitish Narula, Liang Liu, Ganesh Ganapathy, Bastien Boussau, Md. Shamsuzzoha Bayzid, Volodymyr Zavidovych, Sankar Subramanian, Toni Gabaldón, Salvador Capella-Gutiérrez, Jaime Huerta-Cepas, Bhanu Rekepalli, Kasper Munch, Mikkel Schierup, Bent Lindow, Wesley C. Warren, David Ray, Richard E. Green, Michael W. Bruford, Xiangjiang Zhan, Andrew Dixon, Shengbin Li, Ning Li, Yinhua Huang, Elizabeth P. Derryberry, Mads Frost Bertelsen, Frederick H. Sheldon, Robb T. Brumfield, Claudio V. Mello, Peter V. Lovell, Morgan Wirthlin, Maria Paula Cruz Schneider, Francisco Prosdocimi, José Alfredo Samaniego, Amhed Missael Vargas Velazquez, Alonzo Alfaro-Núñez, Paula F. Campos, Bent Petersen, Thomas Sicheritz-Ponten, An Pas, Tom Bailey, Paul Scofield, Michael Bunce, David M. Lambert, Qi Zhou, Polina Perelman, Amy C. Driskell, Beth Shapiro, Zijun Xiong, Yongli Zeng, Shiping Liu, Zhenyu Li, Binghang Liu, Kui Wu, Jin Xiao, Xiong Yinqi, Qiuemei Zheng, Yong Zhang, Huanming Yang, Jian Wang, Linnea Smeds, Frank E. Rheindt, Michael Braun, Jon Fjeldsa, Ludovic Orlando, F. Keith Barker, Knud Andreas Jønsson, Warren Johnson, Klaus-Peter Koepfli, Stephen O’Brien, David Haussler, Oliver A. Ryder, Carsten Rahbek, Eske Willerslev, Gary R. Graves, Travis C. Glenn, John McCormack, Dave Burt, Hans Ellegren, Per Alström, Scott V. Edwards, Alexandros Stamatakis, David P. Mindell, Joel Cracraft, Edward L. Braun, Tandy Warnow, Wang Jun, M. Thomas P. Gilbert, and Guojie Zhang. Whole-genome analyses resolve early branches in the tree of life of modern birds. Science, 346(6215):1320-1331, December 2014. URL: https://doi.org/10.1126/science.1253451.
  15. Alexey M Kozlov, Diego Darriba, Tomáš Flouri, Benoit Morel, and Alexandros Stamatakis. RAxML-NG: a fast, scalable and user-friendly tool for maximum likelihood phylogenetic inference. Bioinformatics, 35(21):4453-4455, May 2019. URL: https://doi.org/10.1093/bioinformatics/btz305.
  16. Laura Salter Kubatko and James H. Degnan. Inconsistency of Phylogenetic Estimates from Concatenated Data under Coalescence. Systematic Biology, 56(1):17-24, February 2007. URL: https://doi.org/10.1080/10635150601146041.
  17. Vincent Lefort, Richard Desper, and Olivier Gascuel. FastME 2.0: A Comprehensive, Accurate, and Fast Distance-Based Phylogeny Inference Program. Molecular Biology and Evolution, 32(10):2798-2800, October 2015. URL: https://doi.org/10.1093/molbev/msv150.
  18. Brandon Legried, Erin K Molloy, Tandy Warnow, and Sébastien Roch. Polynomial-time statistical estimation of species trees under gene duplication and loss. Journal of Computational Biology, 28(5):452-468, 2021. Google Scholar
  19. Frédéric Lemoine and Olivier Gascuel. Gotree/Goalign: toolkit and Go API to facilitate the development of phylogenetic workflows. NAR Genomics and Bioinformatics, 3(3), August 2021. URL: https://doi.org/10.1093/nargab/lqab075.
  20. Baqiao Liu and Tandy Warnow. Data from scalable species tree inference with external constraints, 2021. University of Illinois at Urbana-Champaign. URL: https://doi.org/10.13012/B2IDB-2566000_V1.
  21. Liang Liu and Lili Yu. Estimating Species Trees from Unrooted Gene Trees. Systematic Biology, 60(5):661-667, October 2011. URL: https://doi.org/10.1093/sysbio/syr027.
  22. Liang Liu, Lili Yu, and Scott V. Edwards. A maximum pseudo-likelihood approach for estimating species trees under the coalescent model. BMC Evolutionary Biology, 10(1):302, October 2010. URL: https://doi.org/10.1186/1471-2148-10-302.
  23. Liang Liu, Lili Yu, Dennis K. Pearl, and Scott V. Edwards. Estimating Species Phylogenies Using Coalescence Times among Sequences. Systematic Biology, 58(5):468-477, October 2009. URL: https://doi.org/10.1093/sysbio/syp031.
  24. Wayne P. Maddison. Gene Trees in Species Trees. Systematic Biology, 46(3):523-536, September 1997. URL: https://doi.org/10.1093/sysbio/46.3.523.
  25. Mahim Mahbub, Zahin Wahab, Rezwana Reaz, M Saifur Rahman, and Md Shamsuzzoha Bayzid. wQFM: highly accurate genome-scale species tree estimation from weighted quartets. Bioinformatics, 37(21):3734-3743, November 2021. URL: https://doi.org/10.1093/bioinformatics/btab428.
  26. Alexey Markin and Oliver Eulenstein. Quartet-based inference is statistically consistent under the unified duplication-loss-coalescence model. Bioinformatics, 37(22):4064-4074, 2021. Google Scholar
  27. Andy McKenzie and Mike Steel. Distributions of cherries for two models of trees. Mathematical Biosciences, 164(1):81-92, March 2000. URL: https://doi.org/10.1016/S0025-5564(99)00060-7.
  28. Charles D Michener and Robert R Sokal. A quantitative approach to a problem in classification. Evolution, 11(2):130-162, 1957. Google Scholar
  29. S. Mirarab, R. Reaz, Md. S. Bayzid, T. Zimmermann, M. S. Swenson, and T. Warnow. ASTRAL: genome-scale coalescent-based species tree estimation. Bioinformatics, 30(17):i541-i548, September 2014. URL: https://doi.org/10.1093/bioinformatics/btu462.
  30. Siavash Mirarab. Species Tree Estimation Using ASTRAL: Practical Considerations. arXiv:1904.03826 [q-bio], October 2019. URL: http://arxiv.org/abs/1904.03826.
  31. Siavash Mirarab, Md. Shamsuzzoha Bayzid, Bastien Boussau, and Tandy Warnow. Statistical binning enables an accurate coalescent-based estimation of the avian tree. Science, 346(6215):1250463, December 2014. URL: https://doi.org/10.1126/science.1250463.
  32. Siavash Mirarab, Md Shamsuzzoha Bayzid, and Tandy Warnow. Evaluating summary methods for multilocus species tree estimation in the presence of incomplete lineage sorting. Systematic Biology, 65(3):366-380, 2016. Google Scholar
  33. Siavash Mirarab and Tandy Warnow. ASTRAL-II: coalescent-based species tree estimation with many hundreds of taxa and thousands of genes. Bioinformatics, 31(12):i44-i52, June 2015. URL: https://doi.org/10.1093/bioinformatics/btv234.
  34. Erin K Molloy and Tandy Warnow. To Include or Not to Include: The Impact of Gene Filtering on Species Tree Estimation Methods. Systematic Biology, 67(2):285-303, March 2018. URL: https://doi.org/10.1093/sysbio/syx077.
  35. Naima Moshiri. TreeSwift: A massively scalable Python tree package. SoftwareX, 11:100436, January 2020. URL: https://doi.org/10.1016/j.softx.2020.100436.
  36. Huw A. Ogilvie, Remco R. Bouckaert, and Alexei J. Drummond. StarBEAST2 Brings Faster Species Tree Inference and Accurate Estimates of Substitution Rates. Molecular Biology and Evolution, 34(8):2101-2114, August 2017. URL: https://doi.org/10.1093/molbev/msx126.
  37. Swati Patel, Rebecca T Kimball, and Edward L Braun. Error in phylogenetic estimation for bushes in the tree of life. J. Phylogenet. Evol. Biol, 1(2):1-10, 2013. Google Scholar
  38. Morgan N. Price, Paramvir S. Dehal, and Adam P. Arkin. FastTree 2 – Approximately Maximum-Likelihood Trees for Large Alignments. PLOS ONE, 5(3):e9490, March 2010. URL: https://doi.org/10.1371/journal.pone.0009490.
  39. Maryam Rabiee and Siavash Mirarab. Forcing external constraints on tree inference using astral. BMC genomics, 21(2):1-13, 2020. Google Scholar
  40. John A. Rhodes, Michael G. Nute, and Tandy Warnow. NJst and ASTRID are not statistically consistent under a random model of missing data, 2020. URL: https://doi.org/10.48550/ARXIV.2001.07844.
  41. D. F. Robinson and Leslie R. Foulds. Comparison of phylogenetic trees. Mathematical Biosciences, 53(1):131-147, February 1981. URL: https://doi.org/10.1016/0025-5564(81)90043-2.
  42. Sébastien Roch, Michael Nute, and Tandy Warnow. Long-branch attraction in species tree estimation: inconsistency of partitioned likelihood and topology-based summary methods. Systematic Biology, 68(2):281-297, 2019. Google Scholar
  43. Sébastien Roch and Mike Steel. Likelihood-based tree reconstruction on a concatenation of aligned sequence data sets can be statistically inconsistent. Theoretical Population Biology, 100:56-62, March 2015. URL: https://doi.org/10.1016/j.tpb.2014.12.005.
  44. Naruya Saitou and Masatoshi Nei. The neighbor-joining method: a new method for reconstructing phylogenetic trees. Molecular Biology and Evolution, 4(4):406-425, July 1987. URL: https://doi.org/10.1093/oxfordjournals.molbev.a040454.
  45. Erfan Sayyari and Siavash Mirarab. Fast Coalescent-Based Computation of Local Branch Support from Quartet Frequencies. Molecular Biology and Evolution, 33(7):1654-1668, July 2016. URL: https://doi.org/10.1093/molbev/msw079.
  46. Alexandros Stamatakis. RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics, 30(9):1312-1313, January 2014. URL: https://doi.org/10.1093/bioinformatics/btu033.
  47. Naoyuki Takahata. Gene genealogy in three related populations: consistency probability between gene and population trees. Genetics, 122(4):957-966, August 1989. URL: https://doi.org/10.1093/genetics/122.4.957.
  48. Pranjal Vachaspati. Large scale phylogenomic estimation. PhD thesis, University of Illinois at Urbana-Champaign, 2019. Google Scholar
  49. Pranjal Vachaspati and Tandy Warnow. ASTRID: Accurate Species TRees from Internode Distances. BMC Genomics, 16(10):S3, October 2015. URL: https://doi.org/10.1186/1471-2164-16-S10-S3.
  50. John J Wiens, Caitlin A Kuczynski, Sarah A Smith, Daniel G Mulcahy, Jack W Sites Jr, Ted M Townsend, and Tod W Reeder. Branch lengths, support, and congruence: testing the phylogenomic approach with 20 nuclear loci in snakes. Systematic Biology, 57(3):420-431, 2008. Google Scholar
  51. James Willson, Mrinmoy Saha Roddur, Baqiao Liu, Paul Zaharias, and Tandy Warnow. DISCO: Species Tree Inference using Multicopy Gene Family Tree Decomposition. Systematic Biology, 71(3):610-629, May 2022. URL: https://doi.org/10.1093/sysbio/syab070.
  52. Zhenxiang Xi, Liang Liu, and Charles C. Davis. Genes with minimal phylogenetic information are problematic for coalescent analyses when gene tree estimation is biased. Molecular Phylogenetics and Evolution, 92:63-71, November 2015. URL: https://doi.org/10.1016/j.ympev.2015.06.009.
  53. George Udny Yule. A mathematical theory of evolution, based on the conclusions of Dr. J. C. Willis. Philosophical Transactions of the Royal Society of London. Series B, Containing Papers of a Biological Character, 213(402-410):21-87, January 1925. URL: https://doi.org/10.1098/rstb.1925.0002.
  54. Chao Zhang and Siavash Mirarab. Weighting by gene tree uncertainty improves accuracy of quartet-based species trees. bioRxiv, 2022. URL: https://doi.org/10.1101/2022.02.19.481132.
  55. Chao Zhang, Maryam Rabiee, Erfan Sayyari, and Siavash Mirarab. ASTRAL-III: polynomial time species tree reconstruction from partially resolved gene trees. BMC Bioinformatics, 19(6):153, May 2018. URL: https://doi.org/10.1186/s12859-018-2129-y.