Optimal Completion of Incomplete Gene Trees in Polynomial Time Using OCTAL

Authors Sarah Christensen, Erin K. Molloy, Pranjal Vachaspati, Tandy Warnow

Thumbnail PDF


  • Filesize: 0.64 MB
  • 14 pages

Document Identifiers

Author Details

Sarah Christensen
Erin K. Molloy
Pranjal Vachaspati
Tandy Warnow

Cite AsGet BibTex

Sarah Christensen, Erin K. Molloy, Pranjal Vachaspati, and Tandy Warnow. Optimal Completion of Incomplete Gene Trees in Polynomial Time Using OCTAL. In 17th International Workshop on Algorithms in Bioinformatics (WABI 2017). Leibniz International Proceedings in Informatics (LIPIcs), Volume 88, pp. 27:1-27:14, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2017)


Here we introduce the Optimal Tree Completion Problem, a general optimization problem that involves completing an unrooted binary tree (i.e., adding missing leaves) so as to minimize its distance from a reference tree on a superset of the leaves. More formally, given a pair of unrooted binary trees (T,t) where T has leaf set S and t has leaf set R, a subset of S, we wish to add all the leaves from S \ R to t so as to produce a new tree t' on leaf set S that has the minimum distance to T. We show that when the distance is defined by the Robinson-Foulds (RF) distance, an optimal solution can be found in polynomial time. We also present OCTAL, an algorithm that solves this RF Optimal Tree Completion Problem exactly in quadratic time. We report on a simulation study where we complete estimated gene trees using a reference tree that is based on a species tree estimated from a multi-locus dataset. OCTAL produces completed gene trees that are closer to the true gene trees than an existing heuristic approach, but the accuracy of the completed gene trees computed by OCTAL depends on how topologically similar the estimated species tree is to the true gene tree. Hence, under conditions with relatively low gene tree heterogeneity, OCTAL can be used to provide highly accurate completions of estimated gene trees. We close with a discussion of future research.
  • phylogenomics
  • missing data
  • coalescent-based species tree estimation
  • gene trees


  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    PDF Downloads


  1. Elizabeth S. Allman, James H. Degnan, and John A. Rhodes. Split Probabilities and Species Tree Inference under the Multispecies Coalescent Model. arXiv:1704.04268, 2017. Google Scholar
  2. Md. Shamsuzzoha Bayzid, Siavash Mirarab, Bastien Boussau, and Tandy Warnow. Weighted statistical binning: enabling statistically consistent genome-scale phylogenetic analyses. PLOS One, 10(6):30129183, 2015. URL: http://dx.doi.org/10.1371/journal.pone.0129183.
  3. J. Gordon Burleigh, Khidir W. Hilu, and Douglas E. Soltis. Inferring Phylogenies with Incomplete Data Sets: A 5-gene, 567-taxon analysis of angiosperms. BMC Evolutionary Biology, 9(1):61, 2009. URL: http://dx.doi.org/10.1186/1471-2148-9-61.
  4. Sarah Christensen, Erin Molloy, Pranjal Vachaspati, and Tandy Warnow. Datasets from the study: Optimal completion of incomplete gene trees in polynomial time using OCTAL, 2017. URL: http://dx.doi.org/10.13012/B2IDB-8402610_V1.
  5. William Fletcher and Ziheng Yang. INDELible: A Flexible Simulator of Biological Sequence Evolution. Molecular Biology and Evolution, 26(8):1879-1888, 2009. URL: http://dx.doi.org/10.1093/molbev/msp098.
  6. Peter A. Hosner, Brant C. Faircloth, Travis C. Glenn, Edward L. Braun, and Rebecca T. Kimball. Avoiding Missing Data Biases in Phylogenomic Inference: An Empirical Study in the Landfowl (Aves: Galliformes). Molecular Biology and Evolution, 33(4):1110-1125, 2016. URL: http://dx.doi.org/10.1093/molbev/msv347.
  7. Martyn Kennedy and Roderic D. M. Page. Seabird Supertrees: Combining Partial Estimates of Procellariiform Phylogeny. The Auk, 119(1):88-108, 2002. URL: http://dx.doi.org/10.1642/0004-8038(2002)119[0088:SSCPEO]2.0.CO;2.
  8. Wayne Maddison. Gene Trees in Species Trees. Systematic Biology, 46(3):523-536, 1997. URL: http://dx.doi.org/10.1093/sysbio/46.3.523.
  9. Diego Mallo, Leonardo De Oliveira Martins, and David Posada. SimPhy: phylogenomic simulation of gene, locus, and species trees. Systematic biology, 65(2):334-344, 2016. URL: https://doi.org/10.1093/sysbio/syv082.
  10. Siavash Mir arabbaygi (Mirarab). Novel Scalable Approaches for Multiple Sequence Alignment and Phylogenomic Reconstruction. PhD thesis, The University of Texas at Austin, 2015. URL: http://hdl.handle.net/2152/31377.
  11. Siavash Mirarab, Md. Shamsuzzoha Bayzid, Bastien Boussau, and Tandy Warnow. Statistical binning enables an accurate coalescent-based estimation of the avian tree. Science, 346(6215), 2014. URL: http://dx.doi.org/10.1126/science.1250463.
  12. Siavash Mirarab and Tandy Warnow. ASTRAL-II: Coalescent-based Species Tree Estimation with Many Hundreds of Taxa and Thousands of Genes. Bioinformatics, 31(12):i44, 2015. URL: http://dx.doi.org/10.1093/bioinformatics/btv234.
  13. Erin Molloy and Tandy Warnow. To include or not to include: The impact of gene filtering on species tree estimation methods. bioRxiv, 2017. URL: http://dx.doi.org/10.1101/149120.
  14. David F. Robinson and Leslie R. Foulds. Comparison of Phylogenetic Trees. Mathematical Biosciences, 53(1-2):131-147, 1981. URL: http://dx.doi.org/10.1016/0025-5564(81)90043-2.
  15. Sébastien Roch and Mike Steel. Likelihood-based Tree Reconstruction on a Concatenation of Alignments can be Positively Misleading. arXiv:1409.2051, 2014. Google Scholar
  16. Michael J. Sanderson, Michelle M. McMahon, and Mike Steel. Phylogenomics with incomplete taxon coverage: the limits to inference. BMC Evolutionary Biology, 10, 2010. URL: http://dx.doi.org/10.1186/1471-2148-10-155.
  17. Alexandros Stamatakis. RAxML Version 8: A tool for Phylogenetic Analysis and Post-Analysis of Large Phylogenies. Bioinformatics, 30(9), 2014. URL: http://dx.doi.org/10.1093/bioinformatics/btu033.
  18. Jeffrey W. Streicher, James A. Schulte, II, and John J. Wiens. How Should Genes and Taxa be Sampled for Phylogenomic Analyses with Missing Data? An Empirical Study in Iguanian Lizards. Systematic Biology, 65(1):128, 2016. URL: http://dx.doi.org/10.1093/sysbio/syv058.
  19. Jeet Sukumaran and Mark T. Holder. Dendropy: a Python library for phylogenetic computing. Bioinformatics, 26(12):1569-1571, 2010. URL: http://dx.doi.org/10.1093/bioinformatics/btq228.
  20. Pranjal Vachaspati and Tandy Warnow. ASTRID: Accurate Species Trees from Internode Distances. BMC Genomics, 16(10):S3, 2015. URL: http://dx.doi.org/10.1186/1471-2164-16-S10-S3.
Questions / Remarks / Feedback

Feedback for Dagstuhl Publishing

Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail