Distance-based Species Tree Estimation: Information-Theoretic Trade-off between Number of Loci and Sequence Length under the Coalescent
We consider the reconstruction of a phylogeny from multiple genes under the multispecies coalescent. We establish a connection with the sparse signal detection problem, where one seeks to distinguish between a distribution and a mixture of the distribution and a sparse signal. Using this connection, we derive an information-theoretic trade-off between the number of genes needed for an accurate reconstruction and the sequence length of the genes.
phylogenetic reconstruction
multispecies coalescent
sequence length requirement.
931-942
Regular Paper
Elchanan
Mossel
Elchanan Mossel
Sebastien
Roch
Sebastien Roch
10.4230/LIPIcs.APPROX-RANDOM.2015.931
Elizabeth S. Allman, James H. Degnan, and John A. Rhodes. Identifying the rooted species tree from the distribution of unrooted gene trees under the coalescent. Journal of Mathematical Biology, 62(6):833-862, 2011.
Christian N.K. Anderson, Liang Liu, Dennis Pearl, and Scott V. Edwards. Tangled trees: The challenge of inferring species trees from coalescent and noncoalescent genes. In Maria Anisimova, editor, Evolutionary Genomics, volume 856 of Methods in Molecular Biology, pages 3-28. Humana Press, 2012.
Alexandr Andoni, Constantinos Daskalakis, Avinatan Hassidim, and Sébastien Roch. Global alignment of molecular sequences via ancestral state reconstruction (extended abstract). In ICS, pages 358-369, 2010.
Anand Bhaskar and Yun S. Song. Descartes' rule of signs and the identifiability of population demographic models from genomic variation data. Ann. Statist., 42(6):2469-2493, 2014.
T. Tony Cai, X. Jessie Jeng, and Jiashun Jin. Optimal detection of heterogeneous and heteroscedastic mixtures. J. R. Stat. Soc. Ser. B Stat. Methodol., 73(5):629-662, 2011.
T.T. Cai and Yihong Wu. Optimal detection of sparse mixtures against a given null distribution. Information Theory, IEEE Transactions on, 60(4):2217-2232, April 2014.
L. Cayon, J. Jin, and A. Treaster. Higher criticism statistic: detecting and identifying non-gaussianity in the wmap first-year data. Monthly Notices of the Royal Astronomical Society, 362(3):826-832, 2005.
T. M. Cover and J. A. Thomas. Elements of information theory. Wiley Series in Telecommunications. John Wiley & Sons Inc., New York, 1991. A Wiley-Interscience Publication.
M. Cryan, L. A. Goldberg, and P. W. Goldberg. Evolutionary trees can be learned in polynomial time. SIAM J. Comput., 31(2):375-397, 2002. short version, Proceedings of the 39th Annual Symposium on Foundations of Computer Science (FOCS 98), pages 436-445, 1998.
Gautam Dasarathy, Robert D. Nowak, and Sébastien Roch. New sample complexity bounds for phylogenetic inference from multiple loci. In 2014 IEEE International Symposium on Information Theory, Honolulu, HI, USA, June 29 - July 4, 2014, pages 2037-2041, 2014.
Constantinos Daskalakis, Elchanan Mossel, and Sébastien Roch. Evolutionary trees and the ising model on the bethe lattice: a proof of steel’s conjecture. Probability Theory and Related Fields, 149:149-189, 2011. 10.1007/s00440-009-0246-2.
Constantinos Daskalakis, Elchanan Mossel, and Sébastien Roch. Phylogenies without branch bounds: Contracting the short, pruning the deep. SIAM J. Discrete Math., 25(2):872-893, 2011.
Constantinos Daskalakis and Sébastien Roch. Alignment-free phylogenetic reconstruction. In RECOMB, pages 123-137, 2010.
Michael DeGiorgio and James H Degnan. Fast and consistent estimation of species trees using supermatrix rooted triples. Molecular Biology and Evolution, 27(3):552-69, March 2010.
J. H. Degnan and N. A. Rosenberg. Discordance of species trees with their most likely gene trees. PLoS Genetics, 2(5), May 2006.
James H. Degnan, Michael DeGiorgio, David Bryant, and Noah A. Rosenberg. Properties of consensus methods for inferring species trees from gene trees. Systematic Biology, 58(1):35-54, 2009.
James H. Degnan and Noah A. Rosenberg. Gene tree discordance, phylogenetic inference and the multispecies coalescent. Trends in Ecology and Evolution, 24(6):332-340, 2009.
Frederic Delsuc, Henner Brinkmann, and Herve Philippe. Phylogenomics and the reconstruction of the tree of life. Nat Rev Genet, 6(5):361-375, 05 2005.
R. L. Dobrusin. A statistical problem arising in the theory of detection of signals in the presence of noise in a multi-channel system and leading to stable distribution laws. Theory of Probability & Its Applications, 3(2):161-173, 1958.
David Donoho and Jiashun Jin. Higher criticism for detecting sparse heterogeneous mixtures. Ann. Statist., 32(3):962-994, 06 2004.
Richard Durrett. Probability models for DNA sequence evolution. Probability and its Applications (New York). Springer, New York, second edition, 2008.
P. L. Erdös, M. A. Steel;, L. A. Székely, and T. A. Warnow. A few logs suffice to build (almost) all trees (part 1). Random Struct. Algor., 14(2):153-184, 1999.
P. L. Erdös, M. A. Steel;, L. A. Székely, and T. A. Warnow. A few logs suffice to build (almost) all trees (part 2). Theor. Comput. Sci., 221:77-118, 1999.
J. Felsenstein. Inferring Phylogenies. Sinauer, New York, New York, 2004.
Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The elements of statistical learning. Springer Series in Statistics. Springer, New York, second edition, 2009. Data mining, inference, and prediction.
Yu. I. Ingster. Some problems of hypothesis testing leading to infinitely divisible distributions. Math. Methods Statist., 6(1):47-69, 1997.
X. Jessie Jeng, T. Tony Cai, and Hongzhe Li. Optimal sparse segment identification with application in copy number variation analysis. J. Amer. Statist. Assoc., 105(491):1156-1166, 2010.
T. H. Jukes and C. Cantor. Mammalian protein metabolism. In H. N. Munro, editor, Evolution of protein molecules, pages 21-132. Academic Press, 1969.
Junhyong Kim, Elchanan Mossel, Miklos Z. Racz, and Nathan Ross. Can one hear the shape of a population history? Theoretical Population Biology, 100(0):26-38, 2015.
Martin Kulldorff, Richard Heffernan, Jessica Hartman, Renato Assuncao, and Farzad Mostashari. A space time permutation scan statistic for disease outbreak detection. PLoS Med, 2(3):e59, 02 2005.
Liang Liu, Lili Yu, Laura Kubatko, Dennis K. Pearl, and Scott V. Edwards. Coalescent methods for estimating phylogenetic trees. Molecular Phylogenetics and Evolution, 53(1):320-328, 2009.
Liang Liu, Lili Yu, and Dennis K. Pearl. Maximum tree: a consistent estimator of the species tree. Journal of Mathematical Biology, 60(1):95-106, 2010.
Wayne P. Maddison. Gene trees in species trees. Systematic Biology, 46(3):523-536, 1997.
E. Mossel. On the impossibility of reconstructing ancestral data and phylogenies. J. Comput. Biol., 10(5):669-678, 2003.
E. Mossel. Phase transitions in phylogeny. Trans. Amer. Math. Soc., 356(6):2379-2404, 2004.
E. Mossel and S. Roch. Distance-based species tree estimation: information-theoretic trade-off between number of loci and sequence length under the coalescent. ArXiv e-print 1504.05289, 2015.
Elchanan Mossel and Sébastien Roch. Learning nonsingular phylogenies and hidden Markov models. In STOC'05: Proceedings of the 37th Annual ACM Symposium on Theory of Computing, pages 366-375, New York, 2005. ACM.
Elchanan Mossel and Sébastien Roch. Incomplete lineage sorting: Consistent phylogeny estimation from multiple loci. IEEE/ACM Trans. Comput. Biology Bioinform., 7(1):166-171, 2010.
Elchanan Mossel, Sébastien Roch, and Allan Sly. On the inference of large phylogenies with long branches: How long is too long? Bulletin of Mathematical Biology, 73:1627-1644, 2011. 10.1007/s11538-010-9584-6.
Raphaël Mourad, Christine Sinoquet, Nevin Lianwen Zhang, Tengfei Liu, and Philippe Leray. A survey on latent tree models and applications. J. Artif. Intell. Res. (JAIR), 47:157-203, 2013.
Simon Myers, Charles Fefferman, and Nick Patterson. Can one learn history from the allelic spectrum? Theoretical Population Biology, 73(3):342-348, 2008.
Luay Nakhleh. Computational approaches to species phylogeny inference and gene tree reconciliation. Trends in ecology & evolution, 28(12):10.1016/j.tree.2013.09.004, 12 2013.
Bruce Rannala and Ziheng Yang. Bayes estimation of species divergence times and ancestral population sizes using DNA sequences from multiple loci. Genetics, 164(4):1645-1656, 2003.
Sebastien Roch. Toward extracting all phylogenetic information from matrices of evolutionary distances. Science, 327(5971):1376-1379, 2010.
Sebastien Roch. An analytical comparison of multilocus methods under the multispecies coalescent: The three-taxon case. In Pacific Symposium in Biocomputing 2013, pages 297-306, 2013.
Sebastien Roch and Mike Steel. Likelihood-based tree reconstruction on a concatenation of alignments can be positively misleading. Theoretical Population Biology, 2015. To appear.
Sebastien Roch and Tandy Warnow. On the robustness to gene tree estimation error (or lack thereof) of coalescent-based species tree methods. Systematic Biology, 2015. In press.
C. Semple and M. Steel. Phylogenetics, volume 22 of Mathematics and its Applications series. Oxford University Press, 2003.
M. A. Steel and L. A. Székely. Inverting random functions. II. Explicit bounds for discrete maximum likelihood estimation, with applications. SIAM J. Discrete Math., 15(4):562-575 (electronic), 2002.
Creative Commons Attribution 3.0 Unported license
https://creativecommons.org/licenses/by/3.0/legalcode