DROPS

Document

DOI: 10.4230/LIPIcs.WABI.2023.5

Leveraging Constraints Plus Dynamic Programming for the Large Dollo Parsimony Problem

Authors: Junyan Dai, Tobias Rubel, Yunheng Han, and Erin K. Molloy

Published in: LIPIcs, Volume 273, 23rd International Workshop on Algorithms in Bioinformatics (WABI 2023)

Abstract

The last decade of phylogenetics has seen the development of many methods that leverage constraints plus dynamic programming. The goal of this algorithmic technique is to produce a phylogeny that is optimal with respect to some objective function and that lies within a constrained version of tree space. The popular species tree estimation method ASTRAL, for example, returns a tree that (1) maximizes the quartet score computed with respect to the input gene trees and that (2) draws its branches (bipartitions) from the input constraint set. This technique has yet to be used for classic parsimony problems where the input are binary characters, sometimes with missing values. Here, we introduce the clade-constrained character parsimony problem and present an algorithm that solves this problem in polynomial time for the Dollo criterion score. Dollo parsimony, which requires traits/mutations to be gained at most once but allows them to be lost any number of times, is widely used for tumor phylogenetics as well as species phylogenetics, for example analyses of low-homoplasy retroelement insertions across the vertebrate tree of life. Thus, we implement our algorithm in a software package, called Dollo-CDP, and evaluate its utility in the context of species phylogenetics using both simulated and real data sets. Our results show that Dollo-CDP can improve upon heuristic search from a single starting tree, often recovering a better scoring tree. Moreover, Dollo-CDP scales to data sets with much larger numbers of taxa than branch-and-bound while still having an optimality guarantee, albeit a more restricted one. Lastly, we show that our algorithm for Dollo parsimony can easily be adapted to Camin-Sokal parsimony but not Fitch parsimony.

Cite as

Junyan Dai, Tobias Rubel, Yunheng Han, and Erin K. Molloy. Leveraging Constraints Plus Dynamic Programming for the Large Dollo Parsimony Problem. In 23rd International Workshop on Algorithms in Bioinformatics (WABI 2023). Leibniz International Proceedings in Informatics (LIPIcs), Volume 273, pp. 5:1-5:23, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2023)

Copy BibTex To Clipboard

@InProceedings{dai_et_al:LIPIcs.WABI.2023.5,
  author =	{Dai, Junyan and Rubel, Tobias and Han, Yunheng and Molloy, Erin K.},
  title =	{{Leveraging Constraints Plus Dynamic Programming for the Large Dollo Parsimony Problem}},
  booktitle =	{23rd International Workshop on Algorithms in Bioinformatics (WABI 2023)},
  pages =	{5:1--5:23},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-294-5},
  ISSN =	{1868-8969},
  year =	{2023},
  volume =	{273},
  editor =	{Belazzougui, Djamal and Ouangraoua, A\"{i}da},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2023.5},
  URN =		{urn:nbn:de:0030-drops-186312},
  doi =		{10.4230/LIPIcs.WABI.2023.5},
  annote =	{Keywords: phylogenetics, parsimony, Dollo, Camin-Sokal, dynamic programming, constraints}
}

Document

Abstract

DOI: 10.4230/LIPIcs.WABI.2023.8

Quartets Enable Statistically Consistent Estimation of Cell Lineage Trees Under an Unbiased Error and Missingness Model (Abstract)

Authors: Yunheng Han and Erin K. Molloy

Published in: LIPIcs, Volume 273, 23rd International Workshop on Algorithms in Bioinformatics (WABI 2023)

Abstract

Cancer progression and treatment can be informed by reconstructing its evolutionary history from tumor cells [Lim et al., 2020]. Although many methods exist to estimate evolutionary trees (called phylogenies) from molecular sequences, traditional approaches assume the input data are error-free and the output tree is fully resolved. These assumptions are challenged in tumor phylogenetics because single-cell sequencing produces sparse, error-ridden data and because tumors evolve clonally [Jahn et al., 2016; Schwartz and Schäffer, 2017]. Here, we study the theoretical utility of methods based on quartets (four-leaf, unrooted phylogenetic trees) and triplets (three-leaf, rooted phylogenetic trees), in light of these barriers. Quartets and triplets have long been used as the building blocks for reconstructing the evolutionary history of species [Wilkinson et al., 2005]. The reason triplet-based methods (e.g., MP-EST [Liu et al., 2010]) and quartet-based methods (e.g., ASTRAL [Mirarab et al., 2014]) have garnered such success in species phylogenetics is their good statistical properties under the Multi-Species Coalescent (MSC) model [Pamilo and Nei, 1988; Rannala and Yang, 2003]; see Allman et al. (2011) and Degnan (2006) for identifiability results under the MSC model for quartets and triplets, respectively. Inspired by these efforts, we study the utility of quartets and triplets for estimating cell lineage trees under a popular tumor phylogenetics model [Jahn et al., 2016; Ross and Markowetz, 2016; Wu, 2019; Kizilkale et al., 2022] with two phases. First, mutations arise on a (highly unresolved) cell lineage tree according to the infinite sites model, and second, errors (false positives and false negatives) and missing values are introduced to the resulting mutation data in an unbiased fashion, mimicking data produced by single-cell sequencing protocols. This infinite sites plus unbiased error and missingness (IS+UEM) model generates mutations (rather than gene genealogies like the MSC model). However, a quartet (with leaves bijectively labeled by four cells) is implied by a mutation being present in two cells and absent from two cells [Molloy et al., 2021; Springer et al., 2019]; similarly, a triplet (on three cells) is implied by a mutation being present in two cells and absent from one cell. Our main result is that under the IS+UEM, the most probable quartet identifies the unrooted model cell lineage tree on four cells, with a mild assumption: the probability of false negatives and the probability of false positives must not sum to one. Somewhat surprisingly, our identifiability result for quartets does not extend to triplets, with more restrictive assumptions being required for identifiability. These results motivate seeking an unrooted cell lineage tree such that the number of quartets shared between it and the input mutations is maximized. We prove an optimal solution to this problem is a consistent estimator of the unrooted cell lineage tree under the IS+UEM model; this guarantee includes the case where the model tree is highly unresolved, provided that tree error is defined as the number of false negative branches. We therefore conclude by outlining how quartet-based methods might be employed for tumor phylogenetics given other important challenges like copy number aberrations and doublets.

Cite as

Yunheng Han and Erin K. Molloy. Quartets Enable Statistically Consistent Estimation of Cell Lineage Trees Under an Unbiased Error and Missingness Model (Abstract). In 23rd International Workshop on Algorithms in Bioinformatics (WABI 2023). Leibniz International Proceedings in Informatics (LIPIcs), Volume 273, pp. 8:1-8:2, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2023)

Copy BibTex To Clipboard

@InProceedings{han_et_al:LIPIcs.WABI.2023.8,
  author =	{Han, Yunheng and Molloy, Erin K.},
  title =	{{Quartets Enable Statistically Consistent Estimation of Cell Lineage Trees Under an Unbiased Error and Missingness Model}},
  booktitle =	{23rd International Workshop on Algorithms in Bioinformatics (WABI 2023)},
  pages =	{8:1--8:2},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-294-5},
  ISSN =	{1868-8969},
  year =	{2023},
  volume =	{273},
  editor =	{Belazzougui, Djamal and Ouangraoua, A\"{i}da},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2023.8},
  URN =		{urn:nbn:de:0030-drops-186347},
  doi =		{10.4230/LIPIcs.WABI.2023.8},
  annote =	{Keywords: Tumor Phylogenetics, Cell Lineage Trees, Quartets, Supertrees, ASTRAL}
}

Document

DOI: 10.4230/LIPIcs.WABI.2020.15

Advancing Divide-And-Conquer Phylogeny Estimation Using Robinson-Foulds Supertrees

Authors: Xilin Yu, Thien Le, Sarah Christensen, Erin K. Molloy, and Tandy Warnow

Published in: LIPIcs, Volume 172, 20th International Workshop on Algorithms in Bioinformatics (WABI 2020)

Abstract

One of the Grand Challenges in Science is the construction of the Tree of Life, an evolutionary tree containing several million species, spanning all life on earth. However, the construction of the Tree of Life is enormously computationally challenging, as all the current most accurate methods are either heuristics for NP-hard optimization problems or Bayesian MCMC methods that sample from tree space. One of the most promising approaches for improving scalability and accuracy for phylogeny estimation uses divide-and-conquer: a set of species is divided into overlapping subsets, trees are constructed on the subsets, and then merged together using a "supertree method". Here, we present Exact-RFS-2, the first polynomial-time algorithm to find an optimal supertree of two trees, using the Robinson-Foulds Supertree (RFS) criterion (a major approach in supertree estimation that is related to maximum likelihood supertrees), and we prove that finding the RFS of three input trees is NP-hard. We also present GreedyRFS (a greedy heuristic that operates by repeatedly using Exact-RFS-2 on pairs of trees, until all the trees are merged into a single supertree). We evaluate Exact-RFS-2 and GreedyRFS, and show that they have better accuracy than the current leading heuristic for RFS.

Cite as

Xilin Yu, Thien Le, Sarah Christensen, Erin K. Molloy, and Tandy Warnow. Advancing Divide-And-Conquer Phylogeny Estimation Using Robinson-Foulds Supertrees. In 20th International Workshop on Algorithms in Bioinformatics (WABI 2020). Leibniz International Proceedings in Informatics (LIPIcs), Volume 172, pp. 15:1-15:17, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2020)

Copy BibTex To Clipboard

@InProceedings{yu_et_al:LIPIcs.WABI.2020.15,
  author =	{Yu, Xilin and Le, Thien and Christensen, Sarah and Molloy, Erin K. and Warnow, Tandy},
  title =	{{Advancing Divide-And-Conquer Phylogeny Estimation Using Robinson-Foulds Supertrees}},
  booktitle =	{20th International Workshop on Algorithms in Bioinformatics (WABI 2020)},
  pages =	{15:1--15:17},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-161-0},
  ISSN =	{1868-8969},
  year =	{2020},
  volume =	{172},
  editor =	{Kingsford, Carl and Pisanti, Nadia},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2020.15},
  URN =		{urn:nbn:de:0030-drops-128048},
  doi =		{10.4230/LIPIcs.WABI.2020.15},
  annote =	{Keywords: supertrees, divide-and-conquer, phylogeny estimation}
}

Document

DOI: 10.4230/LIPIcs.WABI.2019.4

TRACTION: Fast Non-Parametric Improvement of Estimated Gene Trees

Authors: Sarah Christensen, Erin K. Molloy, Pranjal Vachaspati, and Tandy Warnow

Published in: LIPIcs, Volume 143, 19th International Workshop on Algorithms in Bioinformatics (WABI 2019)

Abstract

Gene tree correction aims to improve the accuracy of a gene tree by using computational techniques along with a reference tree (and in some cases available sequence data). It is an active area of research when dealing with gene tree heterogeneity due to duplication and loss (GDL). Here, we study the problem of gene tree correction where gene tree heterogeneity is instead due to incomplete lineage sorting (ILS, a common problem in eukaryotic phylogenetics) and horizontal gene transfer (HGT, a common problem in bacterial phylogenetics). We introduce TRACTION, a simple polynomial time method that provably finds an optimal solution to the RF-Optimal Tree Refinement and Completion Problem, which seeks a refinement and completion of an input tree t with respect to a given binary tree T so as to minimize the Robinson-Foulds (RF) distance. We present the results of an extensive simulation study evaluating TRACTION within gene tree correction pipelines on 68,000 estimated gene trees, using estimated species trees as reference trees. We explore accuracy under conditions with varying levels of gene tree heterogeneity due to ILS and HGT. We show that TRACTION matches or improves the accuracy of well-established methods from the GDL literature under conditions with HGT and ILS, and ties for best under the ILS-only conditions. Furthermore, TRACTION ties for fastest on these datasets. TRACTION is available at https://github.com/pranjalv123/TRACTION-RF and the study datasets are available at https://doi.org/10.13012/B2IDB-1747658_V1.

Cite as

Sarah Christensen, Erin K. Molloy, Pranjal Vachaspati, and Tandy Warnow. TRACTION: Fast Non-Parametric Improvement of Estimated Gene Trees. In 19th International Workshop on Algorithms in Bioinformatics (WABI 2019). Leibniz International Proceedings in Informatics (LIPIcs), Volume 143, pp. 4:1-4:16, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2019)

Copy BibTex To Clipboard

@InProceedings{christensen_et_al:LIPIcs.WABI.2019.4,
  author =	{Christensen, Sarah and Molloy, Erin K. and Vachaspati, Pranjal and Warnow, Tandy},
  title =	{{TRACTION: Fast Non-Parametric Improvement of Estimated Gene Trees}},
  booktitle =	{19th International Workshop on Algorithms in Bioinformatics (WABI 2019)},
  pages =	{4:1--4:16},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-123-8},
  ISSN =	{1868-8969},
  year =	{2019},
  volume =	{143},
  editor =	{Huber, Katharina T. and Gusfield, Dan},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2019.4},
  URN =		{urn:nbn:de:0030-drops-110347},
  doi =		{10.4230/LIPIcs.WABI.2019.4},
  annote =	{Keywords: Gene tree correction, horizontal gene transfer, incomplete lineage sorting}
}

Document

DOI: 10.4230/LIPIcs.WABI.2017.27

Optimal Completion of Incomplete Gene Trees in Polynomial Time Using OCTAL

Authors: Sarah Christensen, Erin K. Molloy, Pranjal Vachaspati, and Tandy Warnow

Published in: LIPIcs, Volume 88, 17th International Workshop on Algorithms in Bioinformatics (WABI 2017)

Abstract

Here we introduce the Optimal Tree Completion Problem, a general optimization problem that involves completing an unrooted binary tree (i.e., adding missing leaves) so as to minimize its distance from a reference tree on a superset of the leaves. More formally, given a pair of unrooted binary trees (T,t) where T has leaf set S and t has leaf set R, a subset of S, we wish to add all the leaves from S \ R to t so as to produce a new tree t' on leaf set S that has the minimum distance to T. We show that when the distance is defined by the Robinson-Foulds (RF) distance, an optimal solution can be found in polynomial time. We also present OCTAL, an algorithm that solves this RF Optimal Tree Completion Problem exactly in quadratic time. We report on a simulation study where we complete estimated gene trees using a reference tree that is based on a species tree estimated from a multi-locus dataset. OCTAL produces completed gene trees that are closer to the true gene trees than an existing heuristic approach, but the accuracy of the completed gene trees computed by OCTAL depends on how topologically similar the estimated species tree is to the true gene tree. Hence, under conditions with relatively low gene tree heterogeneity, OCTAL can be used to provide highly accurate completions of estimated gene trees. We close with a discussion of future research.

Cite as

Sarah Christensen, Erin K. Molloy, Pranjal Vachaspati, and Tandy Warnow. Optimal Completion of Incomplete Gene Trees in Polynomial Time Using OCTAL. In 17th International Workshop on Algorithms in Bioinformatics (WABI 2017). Leibniz International Proceedings in Informatics (LIPIcs), Volume 88, pp. 27:1-27:14, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2017)

Copy BibTex To Clipboard

@InProceedings{christensen_et_al:LIPIcs.WABI.2017.27,
  author =	{Christensen, Sarah and Molloy, Erin K. and Vachaspati, Pranjal and Warnow, Tandy},
  title =	{{Optimal Completion of Incomplete Gene Trees in Polynomial Time Using OCTAL}},
  booktitle =	{17th International Workshop on Algorithms in Bioinformatics (WABI 2017)},
  pages =	{27:1--27:14},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-050-7},
  ISSN =	{1868-8969},
  year =	{2017},
  volume =	{88},
  editor =	{Schwartz, Russell and Reinert, Knut},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2017.27},
  URN =		{urn:nbn:de:0030-drops-76392},
  doi =		{10.4230/LIPIcs.WABI.2017.27},
  annote =	{Keywords: phylogenomics, missing data, coalescent-based species tree estimation, gene trees}
}

Search Results

Documents authored by Molloy, Erin K.

Leveraging Constraints Plus Dynamic Programming for the Large Dollo Parsimony Problem

Abstract

Cite as

Quartets Enable Statistically Consistent Estimation of Cell Lineage Trees Under an Unbiased Error and Missingness Model (Abstract)

Abstract

Cite as

Advancing Divide-And-Conquer Phylogeny Estimation Using Robinson-Foulds Supertrees

Abstract

Cite as

TRACTION: Fast Non-Parametric Improvement of Estimated Gene Trees

Abstract

Cite as

Optimal Completion of Incomplete Gene Trees in Polynomial Time Using OCTAL

Abstract

Cite as