Optimal Completion and Comparison of Incomplete Phylogenetic Trees Under Robinson-Foulds Distance

Authors Keegan Yao, Mukul S. Bansal

Thumbnail PDF


  • Filesize: 1.09 MB
  • 23 pages

Document Identifiers

Author Details

Keegan Yao
  • Department of Computer Science and Engineering, University of Connecticut, Storrs, CT, USA
Mukul S. Bansal
  • Department of Computer Science and Engineering, University of Connecticut, Storrs, CT, USA

Cite AsGet BibTex

Keegan Yao and Mukul S. Bansal. Optimal Completion and Comparison of Incomplete Phylogenetic Trees Under Robinson-Foulds Distance. In 32nd Annual Symposium on Combinatorial Pattern Matching (CPM 2021). Leibniz International Proceedings in Informatics (LIPIcs), Volume 191, pp. 25:1-25:23, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021)


The comparison of phylogenetic trees is a fundamental task in phylogenetics and evolutionary biology. In many cases, these comparisons involve trees inferred on the same set of leaves, and many distance measures exist to facilitate such comparisons. However, several applications in phylogenetics require the comparison of trees that have non-identical leaf sets. The traditional approach for handling such comparisons is to first restrict the two trees being compared to just their common leaf set. An alternative, conceptually superior approach that has shown promise is to first complete the trees by adding missing leaves so that the completed trees have identical leaf sets. This alternative approach requires the computation of optimal completions of the two trees that minimize the distance between them. However, no polynomial-time algorithms currently exist for this optimal completion problem under any standard phylogenetic distance measure. In this work, we provide the first polynomial-time algorithms for the above problem under the widely used Robinson-Foulds (RF) distance measure. This hitherto unsolved problem is referred to as the RF(+) problem. We (i) show that a recently proposed linear-time algorithm for a restricted version of the RF(+) problem is a 2-approximation for the RF(+) problem, and (ii) provide an exact O(nk²)-time algorithm for the RF(+) problem, where n is the total number of distinct leaf labels in the two trees being compared and k, bounded above by n, depends on the topologies and leaf set overlap of the two trees. Our results hold for both rooted and unrooted binary trees. We implemented our exact algorithm and applied it to several biological datasets. Our results show that completion-based RF distance can lead to very different inferences regarding phylogenetic similarity compared to traditional RF distance. An open-source implementation of our algorithms is freely available from https://compbio.engr.uconn.edu/software/RF_plus.

Subject Classification

ACM Subject Classification
  • Applied computing → Molecular evolution
  • Mathematics of computing → Combinatorial optimization
  • Theory of computation → Dynamic programming
  • Mathematics of computing → Trees
  • Phylogenetic tree comparison
  • Robinson-Foulds Distance
  • Optimal tree completion
  • Algorithms
  • Dynamic programming


  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    PDF Downloads


  1. Wasiu A. Akanni, Mark Wilkinson, Christopher J. Creevey, Peter G. Foster, and Davide Pisani. Implementing and testing bayesian and maximum-likelihood supertree methods in phylogenetics. Royal Society Open Science, 2(8), 2015. URL: https://doi.org/10.1098/rsos.140436.
  2. Amihood Amir and Dmitry Keselman. Maximum agreement subtree in a set of evolutionary trees: Metrics and efficient algorithms. SIAM Journal on Computing, 26(6):1656-1669, 1997. URL: https://doi.org/10.1137/S0097539794269461.
  3. Mukul S. Bansal. Linear-time algorithms for some phylogenetic tree completion problems under robinson-foulds distance. In Comparative Genomics - 16th International Conference, RECOMB-CG 2018, Magog-Orford, QC, Canada, October 9-12, 2018, Proceedings, pages 209-226, 2018. URL: https://doi.org/10.1007/978-3-030-00834-5_12.
  4. Mukul S. Bansal. Linear-time algorithms for phylogenetic tree completion under robinson–foulds distance. Algorithms for Molecular Biology, 15:6, 2020. Google Scholar
  5. Mukul S. Bansal, Guy Banay, Timothy J. Harlow, J. Peter Gogarten, and Ron Shamir. Systematic inference of highways of horizontal gene transfer in prokaryotes. Bioinformatics, 29(5):571-579, 2013. Google Scholar
  6. Mukul S. Bansal, J. Gordon Burleigh, Oliver Eulenstein, and David Fernández-Baca. Robinson-foulds supertrees. Algorithms for Molecular Biology, 5(1):18, February 2010. Google Scholar
  7. Robin Beck, Olaf Bininda-Emonds, Marcel Cardillo, Fu-Guo Liu, and Andy Purvis. A higher-level MRP supertree of placental mammals. BMC Evol. Biol., 6(1):93, 2006. URL: https://doi.org/10.1186/1471-2148-6-93.
  8. Marcel Cardillo, Olaf R. P. Bininda-Emonds, Elizabeth Boakes, and Andy Purvis. A species-level phylogenetic supertree of marsupials. Journal of Zoology, 264:11-31, 2004. Google Scholar
  9. Gabriel Cardona, Mercè Llabrés, Francesc Rosselló, and Gabriel Valiente. Nodal distances for rooted phylogenetic trees. Journal of Mathematical Biology, 61(2):253-276, August 2010. URL: https://doi.org/10.1007/s00285-009-0295-2.
  10. Ruchi Chaudhary, J Gordon Burleigh, and David Fernandez-Baca. Fast local search for unrooted robinson-foulds supertrees. IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB), 9(4):1004-1013, 2012. Google Scholar
  11. Duhong Chen, J Gordon Burleigh, Mukul S Bansal, and David Fernández-Baca. Phylofinder: an intelligent search engine for phylogenetic tree databases. BMC Evolutionary Biology, 8(1):90, 2008. Google Scholar
  12. Sarah Christensen, Erin K. Molloy, Pranjal Vachaspati, and Tandy Warnow. Optimal Completion of Incomplete Gene Trees in Polynomial Time Using OCTAL. In Russell Schwartz and Knut Reinert, editors, 17th International Workshop on Algorithms in Bioinformatics (WABI 2017), volume 88 of Leibniz International Proceedings in Informatics (LIPIcs), pages 27:1-27:14, Dagstuhl, Germany, 2017. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik. Google Scholar
  13. James A. Cotton, Mark Wilkinson, and Mike Steel. Majority-rule supertrees. Systematic Biology, 56(3):445-452, 2007. URL: https://doi.org/10.1080/10635150701416682.
  14. Douglas E. Critchlow, Dennis K. Pearl, Chunlin Qian, and Daniel Faith. The triples distance for rooted bifurcating phylogenetic trees. Systematic Biology, 45(3):323-334, 1996. URL: https://doi.org/10.1093/sysbio/45.3.323.
  15. Damien M. de Vienne, Tatiana Giraud, and Olivier C. Martin. A congruence index for testing topological similarity between trees. Bioinformatics, 23(23):3119-3124, 2007. URL: https://doi.org/10.1093/bioinformatics/btm500.
  16. Jianrong Dong and David Fernandez-Baca. Properties of majority-rule supertrees. Systematic Biology, 58(3):360-367, 2009. URL: https://doi.org/10.1093/sysbio/syp032.
  17. George F. Estabrook, F. R. McMorris, and Christopher A. Meacham. Comparison of undirected phylogenetic trees based on subtrees of four evolutionary units. Systematic Zoology, 34(2):193-200, 1985. URL: http://www.jstor.org/stable/2413326.
  18. J. Felsenstein. Inferring Phylogenies. Sinauer Assoc., Sunderland, Mass, 2003. Google Scholar
  19. C. R. Finden and A. D. Gordon. Obtaining common pruned trees. Journal of Classification, 2(1):255-276, December 1985. URL: https://doi.org/10.1007/BF01908078.
  20. Kevin Gori, Tomasz Suchan, Nadir Alvarez, Nick Goldman, and Christophe Dessimoz. Clustering genes of common evolutionary history. Molecular Biology and Evolution, 33(6):1590-1605, 2016. URL: https://doi.org/10.1093/molbev/msw038.
  21. Soumya Kundu and Mukul S Bansal. Sagephy: An improved phylogenetic simulation framework for gene and subgene evolution. Bioinformatics, 35(18):3496-3498, 2019. Google Scholar
  22. Anne Kupczok. Split-based computation of majority-rule supertrees. BMC Evolutionary Biology, 11(1):205, July 2011. URL: https://doi.org/10.1186/1471-2148-11-205.
  23. Anne Kupczok, Arndt Von Haeseler, and Steffen Klaere. An exact algorithm for the geodesic distance between phylogenetic trees. Journal of Computational Biology, 15(6):577-591, 2008. Google Scholar
  24. Harris T. Lin, J. Gordon Burleigh, and Oliver Eulenstein. Triplet supertree heuristics for the tree of life. BMC Bioinformatics, 10(1):S8, January 2009. URL: https://doi.org/10.1186/1471-2105-10-S1-S8.
  25. Michelle M. McMahon, Akshay Deepak, David Fernández-Baca, Darren Boss, and Michael J. Sanderson. Stbase: One million species trees for comparative biology. PLOS ONE, 10(2):1-17, February 2015. URL: https://doi.org/10.1371/journal.pone.0117987.
  26. S. Mirarab, R. Reaz, Md. S. Bayzid, T. Zimmermann, M. S. Swenson, and T. Warnow. ASTRAL: genome-scale coalescent-based species tree estimation. Bioinformatics, 30(17):i541-i548, 2014. URL: https://doi.org/10.1093/bioinformatics/btu462.
  27. Siavash Mirarab, Md. Shamsuzzoha Bayzid, Bastien Boussau, and Tandy Warnow. Statistical binning enables an accurate coalescent-based estimation of the avian tree. Science, 346(6215), 2014. URL: https://doi.org/10.1126/science.1250463.
  28. William H Piel, MJ Donoghue, MJ Sanderson, and LUT Netherlands. Treebase: a database of phylogenetic information. In Proceedings of the 2nd International Workshop of Species 2000, 2000. Google Scholar
  29. D.F. Robinson and L.R. Foulds. Comparison of phylogenetic trees. Mathematical Biosciences, 53(1):131-147, 1981. URL: https://doi.org/10.1016/0025-5564(81)90043-2.
  30. Jason TL Wang, Huiyuan Shan, Dennis Shasha, and William H Piel. Fast structural search in phylogenetic databases. Evolutionary Bioinformatics, 2005(1):0-0, 2007. Google Scholar
  31. M.S. Waterman and T.F. Smith. On the similarity of dendrograms. Journal of Theoretical Biology, 73(4):789-800, 1978. URL: https://doi.org/10.1016/0022-5193(78)90137-6.
  32. Christopher Whidden, Norbert Zeh, and Robert G. Beiko. Supertrees based on the subtree prune-and-regraft distance. Systematic Biology, 63(4):566-581, 2014. URL: https://doi.org/10.1093/sysbio/syu023.
  33. M.F. Wojciechowski, M.J. Sanderson, K.P. Steele, and A. Liston. Molecular phylogeny of the "Temperate Herbaceous Tribes" of Papilionoid legumes: a supertree approach. In P.S. Herendeen and A. Bruneau, editors, Advances in Legume Systematics, volume 9, pages 277-298. Royal Botanic Gardens, Kew, 2000. Google Scholar
  34. Yufeng Wu. A practical method for exact computation of subtree prune and regraft distance. Bioinformatics, 25(2):190-196, 2009. URL: https://doi.org/10.1093/bioinformatics/btn606.
  35. Ruriko Yoshida, Kenji Fukumizu, and Chrysafis Vogiatzis. Multilocus phylogenetic analysis with gene tree clustering. Annals of Operations Research, March 2017. URL: https://doi.org/10.1007/s10479-017-2456-9.