New Absolute Fast Converging Phylogeny Estimation Methods with Improved Scalability and Accuracy

Authors Qiuyi (Richard) Zhang, Satish Rao, Tandy Warnow



PDF
Thumbnail PDF

File

LIPIcs.WABI.2018.8.pdf
  • Filesize: 479 kB
  • 12 pages

Document Identifiers

Author Details

Qiuyi (Richard) Zhang
  • Department of Mathematics, University of California at Berkeley, Berkeley CA 94720
Satish Rao
  • Division of Computer Science, University of California at Berkeley, Berkeley CA 94720
Tandy Warnow
  • Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, Illinois, 61801, USA

Cite AsGet BibTex

Qiuyi (Richard) Zhang, Satish Rao, and Tandy Warnow. New Absolute Fast Converging Phylogeny Estimation Methods with Improved Scalability and Accuracy. In 18th International Workshop on Algorithms in Bioinformatics (WABI 2018). Leibniz International Proceedings in Informatics (LIPIcs), Volume 113, pp. 8:1-8:12, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2018)
https://doi.org/10.4230/LIPIcs.WABI.2018.8

Abstract

Absolute fast converging (AFC) phylogeny estimation methods are ones that have been proven to recover the true tree with high probability given sequences whose lengths are polynomial in the number of number of leaves in the tree (once the shortest and longest branch lengths are fixed). While there has been a large literature on AFC methods, the best in terms of empirical performance was DCM_NJ, published in SODA 2001. The main empirical advantage of DCM_NJ over other AFC methods is its use of neighbor joining (NJ) to construct trees on smaller taxon subsets, which are then combined into a tree on the full set of species using a supertree method; in contrast, the other AFC methods in essence depend on quartet trees that are computed independently of each other, which reduces accuracy compared to neighbor joining. However, DCM_NJ is unlikely to scale to large datasets due to its reliance on supertree methods, as no current supertree methods are able to scale to large datasets with high accuracy. In this study we present a new approach to large-scale phylogeny estimation that shares some of the features of DCM_NJ but bypasses the use of supertree methods. We prove that this new approach is AFC and uses polynomial time. Furthermore, we describe variations on this basic approach that can be used with leaf-disjoint constraint trees (computed using methods such as maximum likelihood) to produce other AFC methods that are likely to provide even better accuracy. Thus, we present a new generalizable technique for large-scale tree estimation that is designed to improve scalability for phylogeny estimation methods to ultra-large datasets, and that can be used in a variety of settings (including tree estimation from unaligned sequences, and species tree estimation from gene trees).

Subject Classification

ACM Subject Classification
  • Applied computing → Molecular evolution
Keywords
  • phylogeny estimation
  • short quartets
  • sample complexity
  • absolute fast converging methods
  • neighbor joining
  • maximum likelihood

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Kevin Atteson. The performance of neighbor-joining methods of phylogenetic reconstruction. Algorithmica, 25(2-3):251-278, 1999. Google Scholar
  2. Kazuoki Azuma. Weighted sums of certain dependent random variables. Tohoku Mathematical Journal, Second Series, 19(3):357-367, 1967. Google Scholar
  3. Daniel G Brown and Jakub Truszkowski. Fast error-tolerant quartet phylogeny algorithms. In Annual Symposium on Combinatorial Pattern Matching, pages 147-161. Springer, 2011. Google Scholar
  4. Daniel G Brown and Jakub Truszkowski. Fast phylogenetic tree reconstruction using locality-sensitive hashing. In Workshop on Algorithms for Bioinformatics (WABI), pages 14-29. Springer, 2012. Google Scholar
  5. Peter Buneman. A note on the metric properties of trees. Journal of Combinatorial Theory (B), 17:48-50, 1974. Google Scholar
  6. James A Cavender. Taxonomy with confidence. Mathematical biosciences, 40(3-4):271-280, 1978. Google Scholar
  7. Miklós Csűrös. Fast recovery of evolutionary trees with thousands of nodes. Journal of Computational Biology, 9(2):277-297, 2002. Google Scholar
  8. Peter L. Erdös, Michael A. Steel, Laszlo Székely, and Tandy Warnow. A few logs suffice to build (almost) all trees (i). Random Structures and Algorithms, 14:153-184, 1999. Google Scholar
  9. Peter L. Erdös, Michael A. Steel, Laszlo Székely, and Tandy Warnow. A few logs suffice to build (almost) all trees (ii). Theoretical Computer Science, 221:77-118, 1999. Google Scholar
  10. Daniel Huson, Scott Nettles, Laxmi Parida, Tandy Warnow, and Shibu Yooseph. The disk-covering method for tree reconstruction. In Algorithms and Experiments (ALEX), pages 62-75, 1998. Google Scholar
  11. Valerie King, Li Zhang, and Yunhong Zhou. On the complexity of distance-based evolutionary tree reconstruction. In 14th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 444-453. Society for Industrial and Applied Mathematics, 2003. Google Scholar
  12. Radu Mihaescu, Cameron Hill, and Satish Rao. Fast phylogeny reconstruction through learning of ancestral sequences. Algorithmica, 66(2):419-449, 2013. Google Scholar
  13. Elchanan Mossel. Phase transitions in phylogeny. Transactions of the American Mathematical Society, 356(6):2379-2404, 2004. Google Scholar
  14. Luay Nakhleh, Usman Roshan, Katherine St. John, Jerry Sun, and Tandy Warnow. Designing fast converging phylogenetic methods. Bioinformatics, 17(suppl_1):S190-S198, 2001. Google Scholar
  15. Jerzy Neyman. Molecular studies of evolution: a source of novel statistical problems. In Statistical decision theory and related topics, pages 1-27. Elsevier, 1971. Google Scholar
  16. Sébastien Roch. A short proof that phylogenetic tree reconstruction by maximum likelihood is hard. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 3(1):92-94, 2006. Google Scholar
  17. Sébastien Roch. Sequence-length requirement for distance-based phylogeny reconstruction: Breaking the polynomial barrier. In Proc. of the IEEE Symposium on Foundations of Computer Science (FOCS), pages 729–-738, 2008. Google Scholar
  18. Sébastien Roch. Towards extracting all phylogenetic information from matrices of evolutionary distances. Science, 327(5971):1376-1379, 2010. Google Scholar
  19. Sébastien Roch and Allan Sly. Phase transition in the sample complexity of likelihood-based phylogeny inference. Probability Theory and Related Fields, 169(1):3-62, Oct 2017. Google Scholar
  20. Naruya Saitou and Masatoshi Nei. The neighbor-joining method: a new method for reconstructing phylogenetic trees. Molecular biology and evolution, 4(4):406-425, 1987. Google Scholar
  21. Michael A. Steel. Recovering a tree from the leaf colourations it generates under a Markov model. Applied Mathematics Letters, 7(2):19 - 23, 1994. Google Scholar
  22. Jakub Truszkowski, Yanqi Hao, and Daniel G Brown. Towards a practical O(n log n) phylogeny algorithm. Algorithms for Molecular Biology, 7(1):32, 2012. Google Scholar
  23. Tandy Warnow. Computational Phylogenetics: An introduction to designing methods for phylogeny estimation. Cambridge University Press, 2018. Google Scholar
  24. Tandy Warnow. Supertree construction: Opportunities and challenges. arXiv:1805.03530 [q-bio.PE], 2018. Google Scholar
  25. Tandy Warnow, Bernard M.E. Moret, and Katherine St. John. Absolute convergence: true trees from short sequences. In Proceedings of SODA, pages 186-195. ACM, 2001. Google Scholar
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail