New Absolute Fast Converging Phylogeny Estimation Methods with Improved Scalability and Accuracy

Zhang, Qiuyi (Richard); Rao, Satish; Warnow, Tandy

doi:10.4230/LIPIcs.WABI.2018.8

File

LIPIcs.WABI.2018.8.pdf

Filesize: 479 kB
12 pages

Document Identifiers

DOI: 10.4230/LIPIcs.WABI.2018.8
URN: urn:nbn:de:0030-drops-93108

Author Details

Qiuyi (Richard) Zhang

Department of Mathematics, University of California at Berkeley, Berkeley CA 94720

Satish Rao

Division of Computer Science, University of California at Berkeley, Berkeley CA 94720

Tandy Warnow

Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, Illinois, 61801, USA

Cite AsGet BibTex

Qiuyi (Richard) Zhang, Satish Rao, and Tandy Warnow. New Absolute Fast Converging Phylogeny Estimation Methods with Improved Scalability and Accuracy. In 18th International Workshop on Algorithms in Bioinformatics (WABI 2018). Leibniz International Proceedings in Informatics (LIPIcs), Volume 113, pp. 8:1-8:12, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2018)
https://doi.org/10.4230/LIPIcs.WABI.2018.8

Abstract

Absolute fast converging (AFC) phylogeny estimation methods are ones that have been proven to recover the true tree with high probability given sequences whose lengths are polynomial in the number of number of leaves in the tree (once the shortest and longest branch lengths are fixed). While there has been a large literature on AFC methods, the best in terms of empirical performance was DCM_NJ, published in SODA 2001. The main empirical advantage of DCM_NJ over other AFC methods is its use of neighbor joining (NJ) to construct trees on smaller taxon subsets, which are then combined into a tree on the full set of species using a supertree method; in contrast, the other AFC methods in essence depend on quartet trees that are computed independently of each other, which reduces accuracy compared to neighbor joining. However, DCM_NJ is unlikely to scale to large datasets due to its reliance on supertree methods, as no current supertree methods are able to scale to large datasets with high accuracy. In this study we present a new approach to large-scale phylogeny estimation that shares some of the features of DCM_NJ but bypasses the use of supertree methods. We prove that this new approach is AFC and uses polynomial time. Furthermore, we describe variations on this basic approach that can be used with leaf-disjoint constraint trees (computed using methods such as maximum likelihood) to produce other AFC methods that are likely to provide even better accuracy. Thus, we present a new generalizable technique for large-scale tree estimation that is designed to improve scalability for phylogeny estimation methods to ultra-large datasets, and that can be used in a variety of settings (including tree estimation from unaligned sequences, and species tree estimation from gene trees).

Subject Classification

ACM Subject Classification

Applied computing → Molecular evolution

Keywords

phylogeny estimation
short quartets
sample complexity
absolute fast converging methods
neighbor joining
maximum likelihood

Metrics

Access Statistics
Total Accesses (updated on a weekly basis)

0

PDF Downloads

0

Metadata Views

References

Kevin Atteson. The performance of neighbor-joining methods of phylogenetic reconstruction. Algorithmica, 25(2-3):251-278, 1999.
Kazuoki Azuma. Weighted sums of certain dependent random variables. Tohoku Mathematical Journal, Second Series, 19(3):357-367, 1967.
Daniel G Brown and Jakub Truszkowski. Fast error-tolerant quartet phylogeny algorithms. In Annual Symposium on Combinatorial Pattern Matching, pages 147-161. Springer, 2011.
Daniel G Brown and Jakub Truszkowski. Fast phylogenetic tree reconstruction using locality-sensitive hashing. In Workshop on Algorithms for Bioinformatics (WABI), pages 14-29. Springer, 2012.
Peter Buneman. A note on the metric properties of trees. Journal of Combinatorial Theory (B), 17:48-50, 1974.
James A Cavender. Taxonomy with confidence. Mathematical biosciences, 40(3-4):271-280, 1978.
Miklós Csűrös. Fast recovery of evolutionary trees with thousands of nodes. Journal of Computational Biology, 9(2):277-297, 2002.
Peter L. Erdös, Michael A. Steel, Laszlo Székely, and Tandy Warnow. A few logs suffice to build (almost) all trees (i). Random Structures and Algorithms, 14:153-184, 1999.
Peter L. Erdös, Michael A. Steel, Laszlo Székely, and Tandy Warnow. A few logs suffice to build (almost) all trees (ii). Theoretical Computer Science, 221:77-118, 1999.
Daniel Huson, Scott Nettles, Laxmi Parida, Tandy Warnow, and Shibu Yooseph. The disk-covering method for tree reconstruction. In Algorithms and Experiments (ALEX), pages 62-75, 1998.
Valerie King, Li Zhang, and Yunhong Zhou. On the complexity of distance-based evolutionary tree reconstruction. In 14th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 444-453. Society for Industrial and Applied Mathematics, 2003.
Radu Mihaescu, Cameron Hill, and Satish Rao. Fast phylogeny reconstruction through learning of ancestral sequences. Algorithmica, 66(2):419-449, 2013.
Elchanan Mossel. Phase transitions in phylogeny. Transactions of the American Mathematical Society, 356(6):2379-2404, 2004.
Luay Nakhleh, Usman Roshan, Katherine St. John, Jerry Sun, and Tandy Warnow. Designing fast converging phylogenetic methods. Bioinformatics, 17(suppl_1):S190-S198, 2001.
Jerzy Neyman. Molecular studies of evolution: a source of novel statistical problems. In Statistical decision theory and related topics, pages 1-27. Elsevier, 1971.
Sébastien Roch. A short proof that phylogenetic tree reconstruction by maximum likelihood is hard. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 3(1):92-94, 2006.
Sébastien Roch. Sequence-length requirement for distance-based phylogeny reconstruction: Breaking the polynomial barrier. In Proc. of the IEEE Symposium on Foundations of Computer Science (FOCS), pages 729–-738, 2008.
Sébastien Roch. Towards extracting all phylogenetic information from matrices of evolutionary distances. Science, 327(5971):1376-1379, 2010.
Sébastien Roch and Allan Sly. Phase transition in the sample complexity of likelihood-based phylogeny inference. Probability Theory and Related Fields, 169(1):3-62, Oct 2017.
Naruya Saitou and Masatoshi Nei. The neighbor-joining method: a new method for reconstructing phylogenetic trees. Molecular biology and evolution, 4(4):406-425, 1987.
Michael A. Steel. Recovering a tree from the leaf colourations it generates under a Markov model. Applied Mathematics Letters, 7(2):19 - 23, 1994.
Jakub Truszkowski, Yanqi Hao, and Daniel G Brown. Towards a practical O(n log n) phylogeny algorithm. Algorithms for Molecular Biology, 7(1):32, 2012.
Tandy Warnow. Computational Phylogenetics: An introduction to designing methods for phylogeny estimation. Cambridge University Press, 2018.
Tandy Warnow. Supertree construction: Opportunities and challenges. arXiv:1805.03530 [q-bio.PE], 2018.
Tandy Warnow, Bernard M.E. Moret, and Katherine St. John. Absolute convergence: true trees from short sequences. In Proceedings of SODA, pages 186-195. ACM, 2001.

New Absolute Fast Converging Phylogeny Estimation Methods with Improved Scalability and Accuracy

Authors Qiuyi (Richard) Zhang, Satish Rao, Tandy Warnow

File

Document Identifiers

Author Details

Cite AsGet BibTex

Abstract

Subject Classification

ACM Subject Classification

Keywords

Metrics

References

Thanks for your feedback!

Could not send message

New Absolute Fast Converging Phylogeny Estimation Methods with Improved Scalability and Accuracy

Authors Qiuyi (Richard) Zhang, Satish Rao, Tandy Warnow

File

Document Identifiers

Author Details

Funding

Cite AsGet BibTex

Abstract

Subject Classification

ACM Subject Classification

Keywords

Metrics

References

Thanks for your feedback!

Could not send message