Distance-based Species Tree Estimation: Information-Theoretic Trade-off between Number of Loci and Sequence Length under the Coalescent

Authors Elchanan Mossel, Sebastien Roch

Thumbnail PDF


  • Filesize: 1.73 MB
  • 12 pages

Document Identifiers

Author Details

Elchanan Mossel
Sebastien Roch

Cite AsGet BibTex

Elchanan Mossel and Sebastien Roch. Distance-based Species Tree Estimation: Information-Theoretic Trade-off between Number of Loci and Sequence Length under the Coalescent. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2015). Leibniz International Proceedings in Informatics (LIPIcs), Volume 40, pp. 931-942, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2015)


We consider the reconstruction of a phylogeny from multiple genes under the multispecies coalescent. We establish a connection with the sparse signal detection problem, where one seeks to distinguish between a distribution and a mixture of the distribution and a sparse signal. Using this connection, we derive an information-theoretic trade-off between the number of genes needed for an accurate reconstruction and the sequence length of the genes.
  • phylogenetic reconstruction
  • multispecies coalescent
  • sequence length requirement.


  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    PDF Downloads


  1. Elizabeth S. Allman, James H. Degnan, and John A. Rhodes. Identifying the rooted species tree from the distribution of unrooted gene trees under the coalescent. Journal of Mathematical Biology, 62(6):833-862, 2011. Google Scholar
  2. Christian N.K. Anderson, Liang Liu, Dennis Pearl, and Scott V. Edwards. Tangled trees: The challenge of inferring species trees from coalescent and noncoalescent genes. In Maria Anisimova, editor, Evolutionary Genomics, volume 856 of Methods in Molecular Biology, pages 3-28. Humana Press, 2012. Google Scholar
  3. Alexandr Andoni, Constantinos Daskalakis, Avinatan Hassidim, and Sébastien Roch. Global alignment of molecular sequences via ancestral state reconstruction (extended abstract). In ICS, pages 358-369, 2010. Google Scholar
  4. Anand Bhaskar and Yun S. Song. Descartes' rule of signs and the identifiability of population demographic models from genomic variation data. Ann. Statist., 42(6):2469-2493, 2014. Google Scholar
  5. T. Tony Cai, X. Jessie Jeng, and Jiashun Jin. Optimal detection of heterogeneous and heteroscedastic mixtures. J. R. Stat. Soc. Ser. B Stat. Methodol., 73(5):629-662, 2011. Google Scholar
  6. T.T. Cai and Yihong Wu. Optimal detection of sparse mixtures against a given null distribution. Information Theory, IEEE Transactions on, 60(4):2217-2232, April 2014. Google Scholar
  7. L. Cayon, J. Jin, and A. Treaster. Higher criticism statistic: detecting and identifying non-gaussianity in the wmap first-year data. Monthly Notices of the Royal Astronomical Society, 362(3):826-832, 2005. Google Scholar
  8. T. M. Cover and J. A. Thomas. Elements of information theory. Wiley Series in Telecommunications. John Wiley & Sons Inc., New York, 1991. A Wiley-Interscience Publication. Google Scholar
  9. M. Cryan, L. A. Goldberg, and P. W. Goldberg. Evolutionary trees can be learned in polynomial time. SIAM J. Comput., 31(2):375-397, 2002. short version, Proceedings of the 39th Annual Symposium on Foundations of Computer Science (FOCS 98), pages 436-445, 1998. Google Scholar
  10. Gautam Dasarathy, Robert D. Nowak, and Sébastien Roch. New sample complexity bounds for phylogenetic inference from multiple loci. In 2014 IEEE International Symposium on Information Theory, Honolulu, HI, USA, June 29 - July 4, 2014, pages 2037-2041, 2014. Google Scholar
  11. Constantinos Daskalakis, Elchanan Mossel, and Sébastien Roch. Evolutionary trees and the ising model on the bethe lattice: a proof of steel’s conjecture. Probability Theory and Related Fields, 149:149-189, 2011. 10.1007/s00440-009-0246-2. Google Scholar
  12. Constantinos Daskalakis, Elchanan Mossel, and Sébastien Roch. Phylogenies without branch bounds: Contracting the short, pruning the deep. SIAM J. Discrete Math., 25(2):872-893, 2011. Google Scholar
  13. Constantinos Daskalakis and Sébastien Roch. Alignment-free phylogenetic reconstruction. In RECOMB, pages 123-137, 2010. Google Scholar
  14. Michael DeGiorgio and James H Degnan. Fast and consistent estimation of species trees using supermatrix rooted triples. Molecular Biology and Evolution, 27(3):552-69, March 2010. Google Scholar
  15. J. H. Degnan and N. A. Rosenberg. Discordance of species trees with their most likely gene trees. PLoS Genetics, 2(5), May 2006. Google Scholar
  16. James H. Degnan, Michael DeGiorgio, David Bryant, and Noah A. Rosenberg. Properties of consensus methods for inferring species trees from gene trees. Systematic Biology, 58(1):35-54, 2009. Google Scholar
  17. James H. Degnan and Noah A. Rosenberg. Gene tree discordance, phylogenetic inference and the multispecies coalescent. Trends in Ecology and Evolution, 24(6):332-340, 2009. Google Scholar
  18. Frederic Delsuc, Henner Brinkmann, and Herve Philippe. Phylogenomics and the reconstruction of the tree of life. Nat Rev Genet, 6(5):361-375, 05 2005. Google Scholar
  19. R. L. Dobrusin. A statistical problem arising in the theory of detection of signals in the presence of noise in a multi-channel system and leading to stable distribution laws. Theory of Probability & Its Applications, 3(2):161-173, 1958. Google Scholar
  20. David Donoho and Jiashun Jin. Higher criticism for detecting sparse heterogeneous mixtures. Ann. Statist., 32(3):962-994, 06 2004. Google Scholar
  21. Richard Durrett. Probability models for DNA sequence evolution. Probability and its Applications (New York). Springer, New York, second edition, 2008. Google Scholar
  22. P. L. Erdös, M. A. Steel;, L. A. Székely, and T. A. Warnow. A few logs suffice to build (almost) all trees (part 1). Random Struct. Algor., 14(2):153-184, 1999. Google Scholar
  23. P. L. Erdös, M. A. Steel;, L. A. Székely, and T. A. Warnow. A few logs suffice to build (almost) all trees (part 2). Theor. Comput. Sci., 221:77-118, 1999. Google Scholar
  24. J. Felsenstein. Inferring Phylogenies. Sinauer, New York, New York, 2004. Google Scholar
  25. Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The elements of statistical learning. Springer Series in Statistics. Springer, New York, second edition, 2009. Data mining, inference, and prediction. Google Scholar
  26. Yu. I. Ingster. Some problems of hypothesis testing leading to infinitely divisible distributions. Math. Methods Statist., 6(1):47-69, 1997. Google Scholar
  27. X. Jessie Jeng, T. Tony Cai, and Hongzhe Li. Optimal sparse segment identification with application in copy number variation analysis. J. Amer. Statist. Assoc., 105(491):1156-1166, 2010. Google Scholar
  28. T. H. Jukes and C. Cantor. Mammalian protein metabolism. In H. N. Munro, editor, Evolution of protein molecules, pages 21-132. Academic Press, 1969. Google Scholar
  29. Junhyong Kim, Elchanan Mossel, Miklos Z. Racz, and Nathan Ross. Can one hear the shape of a population history? Theoretical Population Biology, 100(0):26-38, 2015. Google Scholar
  30. Martin Kulldorff, Richard Heffernan, Jessica Hartman, Renato Assuncao, and Farzad Mostashari. A space time permutation scan statistic for disease outbreak detection. PLoS Med, 2(3):e59, 02 2005. Google Scholar
  31. Liang Liu, Lili Yu, Laura Kubatko, Dennis K. Pearl, and Scott V. Edwards. Coalescent methods for estimating phylogenetic trees. Molecular Phylogenetics and Evolution, 53(1):320-328, 2009. Google Scholar
  32. Liang Liu, Lili Yu, and Dennis K. Pearl. Maximum tree: a consistent estimator of the species tree. Journal of Mathematical Biology, 60(1):95-106, 2010. Google Scholar
  33. Wayne P. Maddison. Gene trees in species trees. Systematic Biology, 46(3):523-536, 1997. Google Scholar
  34. E. Mossel. On the impossibility of reconstructing ancestral data and phylogenies. J. Comput. Biol., 10(5):669-678, 2003. Google Scholar
  35. E. Mossel. Phase transitions in phylogeny. Trans. Amer. Math. Soc., 356(6):2379-2404, 2004. Google Scholar
  36. E. Mossel and S. Roch. Distance-based species tree estimation: information-theoretic trade-off between number of loci and sequence length under the coalescent. ArXiv e-print 1504.05289, 2015. Google Scholar
  37. Elchanan Mossel and Sébastien Roch. Learning nonsingular phylogenies and hidden Markov models. In STOC'05: Proceedings of the 37th Annual ACM Symposium on Theory of Computing, pages 366-375, New York, 2005. ACM. Google Scholar
  38. Elchanan Mossel and Sébastien Roch. Incomplete lineage sorting: Consistent phylogeny estimation from multiple loci. IEEE/ACM Trans. Comput. Biology Bioinform., 7(1):166-171, 2010. Google Scholar
  39. Elchanan Mossel, Sébastien Roch, and Allan Sly. On the inference of large phylogenies with long branches: How long is too long? Bulletin of Mathematical Biology, 73:1627-1644, 2011. 10.1007/s11538-010-9584-6. Google Scholar
  40. Raphaël Mourad, Christine Sinoquet, Nevin Lianwen Zhang, Tengfei Liu, and Philippe Leray. A survey on latent tree models and applications. J. Artif. Intell. Res. (JAIR), 47:157-203, 2013. Google Scholar
  41. Simon Myers, Charles Fefferman, and Nick Patterson. Can one learn history from the allelic spectrum? Theoretical Population Biology, 73(3):342-348, 2008. Google Scholar
  42. Luay Nakhleh. Computational approaches to species phylogeny inference and gene tree reconciliation. Trends in ecology & evolution, 28(12):10.1016/j.tree.2013.09.004, 12 2013. Google Scholar
  43. Bruce Rannala and Ziheng Yang. Bayes estimation of species divergence times and ancestral population sizes using DNA sequences from multiple loci. Genetics, 164(4):1645-1656, 2003. Google Scholar
  44. Sebastien Roch. Toward extracting all phylogenetic information from matrices of evolutionary distances. Science, 327(5971):1376-1379, 2010. Google Scholar
  45. Sebastien Roch. An analytical comparison of multilocus methods under the multispecies coalescent: The three-taxon case. In Pacific Symposium in Biocomputing 2013, pages 297-306, 2013. Google Scholar
  46. Sebastien Roch and Mike Steel. Likelihood-based tree reconstruction on a concatenation of alignments can be positively misleading. Theoretical Population Biology, 2015. To appear. Google Scholar
  47. Sebastien Roch and Tandy Warnow. On the robustness to gene tree estimation error (or lack thereof) of coalescent-based species tree methods. Systematic Biology, 2015. In press. Google Scholar
  48. C. Semple and M. Steel. Phylogenetics, volume 22 of Mathematics and its Applications series. Oxford University Press, 2003. Google Scholar
  49. M. A. Steel and L. A. Székely. Inverting random functions. II. Explicit bounds for discrete maximum likelihood estimation, with applications. SIAM J. Discrete Math., 15(4):562-575 (electronic), 2002. Google Scholar