Dolphyin: A Combinatorial Algorithm for Identifying -Dollo Phylogenies in Cancer
Abstract
Several recent cancer phylogeny inference methods have used the -Dollo evolutionary model for single-nucleotide variants. Specifically, in this problem one is given an binary matrix and seeks a rooted tree with leaves that correspond to the rows of , and each node of is labeled by a binary state for each of the characters subject to the restriction that each character is gained at most once (-to- transition) and subsequently lost at most times (-to- transitions). The -Dollo variant, also known as the persistent perfect phylogeny where one is restricted to at most losses per character, has been studied extensively, but its hardness remains an open question. Here, we prove that the -Dollo Linear Phylogeny (1DLP) problem, where we additionally require the resulting -Dollo phylogeny to be linear, is equivalent to verifying whether the input matrix adheres to the Consecutive Ones Property (C1P), which can be solved in polynomial time. Due to the equivalence, several known NP-hardness results for relevant variants of C1P carry over to 1DLP, including the minimization of false negatives (-to- modifications to the input matrix ) or the allowance of gains and losses. We furthermore show how we can recursively decompose any, not necessarily linear, -Dollo phylogeny into several -Dollo linear phylogenies, connected by matching branching points. We extend this characterization to matrices that admit -Dollo phylogenies, giving necessary and sufficient conditions for the existence of a novel decomposition of into several submatrices and corresponding branching points. This decomposition forms the basis of Dolphyin, a new exponential-time algorithm for inferring -Dollo phylogenies that efficiently leverages the determination of linear -Dollo phylogenies as a subroutine. Dolphyin can also be applied to input matrices with false negatives. We demonstrate that Dolphyin is runtime-competitive with a previous integer linear programming based algorithm SPhyR on simulated datasets. We additionally analyze simulated datasets with false negative errors and find that in the median case, Dolphyin infers -Dollo phylogenies with inferred error rates at or below the ground truth rate. Finally, we apply Dolphyin to acute myeloid leukemia single-cell sequencing datasets, finding that the majority of the cancers can be explained by -Dollo phylogenies with false negative error rates in line with the used sequencing technology.
Availability.
Dolphyin is available at: https://github.com/elkebir-group/Dolphyin.
Keywords and phrases:
Intra-tumor heterogeneity, persistent perfect phylogeny, consecutive ones property, combinatoricsFunding:
Mohammed El-Kebir: National Science Foundation grant CCF-2046488 (M.E-K.) – This project was conducted as part of Spring 2024, CS 598 MEB: Computational Cancer Genomics.Copyright and License:
2012 ACM Subject Classification:
Applied computing Computational genomicsSupplementary Material:
archived at
swh:1:dir:61c0cd5f22da3a8e10cc6fdac70dd7d93ecb6be5
Editors:
Broňa Brejová and Rob PatroSeries and Publisher:
Leibniz International Proceedings in Informatics, Schloss Dagstuhl – Leibniz-Zentrum für Informatik
1 Introduction
The clonal theory of cancer states that tumors are composed of heterogeneous clones, which are groups of cells with similar genotypes. Clones arise from an evolutionary process during which somatic mutations accumulate in cell populations [25]. By performing bulk or single-cell sequencing on these clones’ DNA or RNA, scientists can identify common somatic mutations such as single-nucleotide variants (SNVs), copy number aberrations (CNAs), or structural variants (SVs). Then, algorithms attempt to infer phylogenies, which represent the evolution of the tumor, from sequencing data for important downstream analysis and clinical decision-making [30]. These phylogenetic inference algorithms utilize sequencing-technology specific error characteristics and constraints on how these somatic mutations accumulate, which constitute an evolutionary model specific to the somatic mutations of interest [29].
In this work, we focus on the presence or absence of SNVs in single-cell DNA sequencing data. More precisely, we are given single-cell data in the form of a matrix where each row in the data is a taxon, representing a tumor cell, and each column is a single-nucleotide variant (SNV), hereafter referred to as a character. The entries in the data matrix would then be either or , indicating the absence or presence of a mutation in a particular cell. We wish to find a phylogeny, i.e. a rooted, node-labeled tree whose leaves represent the extant cells of the tumor, internal nodes represent ancestral tumor cells, and the root represents a normal cell [25], that explains this data. We would then need to assume an evolutionary model that constrains the phylogenies allowed on this data. Under the well-studied two-state perfect phylogeny model [1, 20, 15], for example, no character can be lost once gained in a path starting from the root of tree . Detecting whether an assumed error-free binary data matrix allows a phylogeny under the perfect phylogeny model is solvable in polynomial time [18, 1].
However, the restrictiveness of the perfect phylogeny model of evolution has inspired investigation into a wide range of more generalized and biologically-plausible models [4, 13]. Many analyses have operated under the flexible -Dollo model of evolution, under which any character may be gained exactly once but lost in the tumor’s evolution at most times [9, 28, 12, 7]. This flexibility affords the incorporation of common biological events, such as CNAs that may delete previously gained SNVs, and can thus be much more realistically versatile in biological analysis. The -Dollo phylogeny inference [11, 26] and tree size-constrained versions of Dollo phylogeny inference [11] are known to be NP-hard. Additionally, the problem variant for Dollo inference where the total number of losses summed over all characters is minimized, rather than outright bounded per character, can be solved in polynomial time when the resulting phylogeny is clade-constrained [9].
An important subcase of the -Dollo problem is the subcase or -Dollo problem, also known as the persistent phylogeny problem (Figure 1). The -Dollo problem has been extensively studied, using various problem statements, for over 20 years [17]. For example, characterizations of the -Dollo problem have yielded an exact algorithm that solves the -Dollo problem in time polynomial to the number of taxa and exponential to the number of characters [2]. Other work has also developed Integer Linear Programming (ILP) solutions to the -Dollo problem and shown a connection between galled trees and -Dollo phylogenies [19]. Graph-based approaches, specifically the ability to manipulate colored graphs representative of data matrices using sequenced and specific graph operations, have additionally yielded polynomial-time algorithms for a restricted version of the -Dollo problem [3]. However, the complexity of the general -Dollo problem remains an open question [3].
Separately, tumor phylogenies can either be linear – dash that is, phylogeny has no nontrivial branching points in which cell evolution diverges – dash or branching otherwise, based on the tumor’s selective pressures. In this work, we define “branching points” to entail internal nodes in with more than one child that are also internal nodes. Many phylogenies on real data, while branching, have exhibited a disproportionately small number of branching points relative to the number of taxa [28, 23]. Such phylogenies plausibly arise when the tumor microenvironment exerts severe enough selective pressure to limit branching to a few, highly viable offshoot clones [10]. This observation motivates phylogenetic inference that is specialized for finding linear or near-linear phylogenies. For example, machine learning techniques have been used to determine if a tumor phylogeny is likely linear [27] and, in previous work, we showed that determining the minimum number of changes from to in a data matrix such that the altered matrix is then representative of even a linear perfect phylogeny is NP-hard [31]. However, as far we are aware, determining -Dollo phylogenies on data that are strictly linear has not yet been explicitly examined.
In this paper, our aims are first theoretical and second experimental. First, we draw an equivalence between determining if a data matrix admits a -Dollo linear phylogeny and determining if has the consecutive ones property, which is a known property verifiable in polynomial time [16]. We use this theoretical characterization of matrices admitting -Dollo linear phylogenies to discuss natural problem variants, such as determining -Dollo linear phylogenies with fixed character-state vectors for the root or terminating leaf, and show that determining the minimal number of false negative entries in a sequencing data input to allow such a phylogeny, contrastingly, is NP-hard. As a tree can be recursively decomposed around its branching points, we then use this linear subcase of the -Dollo problem to recursively characterize all matrices admitting any -Dollo phylogeny, regardless of branching, with a series of necessary and sufficient conditions. Second, we develop a combinatorial algorithm, Dolphyin (DOllo Linear PHYlogeny INference Method), that uses this theoretical characterization to practically determine -Dollo phylogenies on sequencing data. We show that Dolphyin, which relies on determining linear chains of taxa satisfying the -Dollo model of evolution and then recursing on remaining taxa, is runtime-competitive with SPhyR – dash an ILP-based method of inference for the -Dollo problem. Additionally, we adapt Dolphyin to probabilistically correct for false-negative errors in sequencing. We apply Dolphyin to simulated datasets with false-negative errors and show that Dolphyin yields -Dollo phylogenies with inferred rates of error at or below the true rate of error. Finally, we apply Dolphyin to 99 real datasets of acute myeloid leukemia (AML) single cell sequencing data [23] and find that Dolphyin infers -Dollo phylogenies for 55 datasets with false negative error rates consistent with the used sequencing technology.
2 Problem Statement
Suppose we have sequenced cells of a tumor and identified single-nucleotide variants (SNVs). We are given a binary matrix , whose rows or taxa correspond to the sequenced cells and whose columns or characters correspond to the SNVs. The tumor cells share a common evolutionary history, represented by a rooted and node-labeled tree . Here, we require to adhere to the -Dollo evolutionary model [13], defined as follows.
Definition 1.
A rooted, node-labeled tree is a -Dollo phylogeny for an binary matrix rooted at provided
-
(i)
each node in is labeled by a binary vector ;
-
(ii)
the root of is labeled by ;
-
(iii)
has leaves such that each taxon corresponds to exactly one leaf in with parent such that ;
-
(iv)
for each character where , there is at most one gain edge such that and , and at most loss edges such that and ;
-
(v)
for each character where , there is no gain edge and at most loss edges.
We omit the subscript from node labeling and taxa mapping if it is clear from context that they apply to a particular tree . In this paper, we restrict our attention to the common case where at most loss per character is allowed and, unless otherwise stated, assume that the root must be labeled by all s, i.e. . We call such trees simply -Dollo phylogenies for . Thus, we seek to solve the following problem.
Problem 1 (-Dollo Phylogeny (1DP)).
Given binary matrix , build a -Dollo phylogeny for or show that one does not exist.
We note that the above problem is also known as the persistent phylogeny problem [2]. In addition to the above problem, we are also interested in the problem where is required to be a -Dollo linear phylogeny for .
Definition 2.
A -Dollo phylogeny for a binary matrix is linear if the removal of the leaves of corresponding to the taxa yields a chain graph.
Problem 2 (-Dollo Linear Phylogeny (1DLP)).
Given binary matrix , is there a -Dollo linear phylogeny for , and if so, build one.
3 Combinatorial Characterization
We characterize the solution spaces of both 1DP and 1DLP, starting with the more restrictive problem, 1DLP, in Section 3.1. We then build on this result by discussing the complexity of common problem variants to 1DLP. Finally, we demonstrate in Section 3.2 that solutions to the 1DP problem can be recursively characterized in terms of itself and -Dollo linear phylogenies. Due to space constraints, we delegate the proofs of all lemmas and theorems to the supplement.
To simplify the exposition, we introduce the following definition of a compact phylogeny and lemma, stating that one can assume without loss of generality that all internal, non-root nodes of solutions must either correspond to observed taxa or are branching points, and all internal edges must be either a gain or a loss edge for some character. Intuitively, any internal, non-root nodes of a given -Dollo phylogeny that do not correspond to one of these two cases can always be contracted to create a compact -Dollo phylogeny .
Definition 3.
A -Dollo phylogeny with root for a matrix is compact if
-
(i)
each internal node of either corresponds to an observed taxon, i.e. has a leaf child such that ; or is a branching point, i.e. has two distinct outgoing edges , such that and ; and
-
(ii)
and every internal edge is either a gain or loss edge for some character so that .
Lemma 4.
For each -Dollo phylogeny for a matrix there exists a unique compact -Dollo phylogeny for matrix obtained from .
3.1 1DLP and the Consecutive Ones Property
We show that the 1DLP problem is equivalent to determining whether the input binary matrix satisfies the consecutive ones property, which is defined as follows.
Definition 5 (Ref. [16]).
An binary matrix has the consecutive ones property (C1P) if there exists a permutation such that for each column the s appear consecutively when permuting the rows of according to .
To demonstrate this equivalence, we first propose the following construction of obtaining a tree from a matrix that is C1P with permutation , illustrated in Figure 2.
Definition 6.
The rooted, node-labeled tree resulting from a binary matrix that is C1P with permutation has (i) a root node labeled by , (ii) internal nodes and leaves labeled by for each taxon , (iii) edges and for each taxon and taxon leaf labeling for each taxon .
This construction leads us to the main theorem of this section.
Theorem 7.
There exists a -Dollo linear phylogeny for if and only if is C1P.
As determining whether any binary matrix is C1P including determining the corresponding permutation of rows is solvable in time using PQ trees [5], 1DLP is similarly solvable in time.
Corollary 8.
1DLP is solvable in time.
3.1.1 Rooted and Terminating Variants of 1DLP
A natural generalization of 1DLP is the Rooted -Dollo Linear Phylogeny (R1DLP) problem, where the root node must be labeled by a given vector not necessarily equal to .
Problem 3 (Rooted -Dollo Linear Phylogeny (R1DLP)).
Given binary matrix and binary vector , is there a -Dollo linear phylogeny for rooted at , and if so, build one.
Clearly, the 1DLP problem is a special case of the 1RDLP problem where . In the following, we show that the 1RDLP problem can also be solved in time by extending matrix as follows (Figure 3).
Definition 9.
The binary matrix resulting from binary matrix and -dimensional binary vector has entries equal to
| (1) |
Lemma 10.
There exists a -Dollo linear phylogeny for rooted at if and only if there exists a -Dollo linear phylogeny for .
Corollary 11.
R1DLP is solvable in time.
A second generalization of 1DLP is the Rooted, Terminated -Dollo Linear Phylogeny (RT1DLP) problem where upon removal of the leaves corresponding to the taxa of the root node is labeled by and the sink node is labeled by (Figure 3). More precisely, we have the following definition for such a constrained -Dollo linear phylogeny.
Definition 12.
A -Dollo linear phylogeny for rooted at terminates at if removing the leaves corresponding to the taxa yields a chain graph terminating at node such that .
Problem 4 (Rooted, Terminated -Dollo Linear Phylogeny (RT1DLP)).
Given binary matrix and binary vectors , is there a -Dollo linear phylogeny for rooted at and terminating at , and if so, build one.
Again, R1DLP (Problem 3) is a special case of RT1DLP where and 1DLP (Problem 2) is a special case of RT1DLP where . The 1RTDLP problem can be solved in time by a similar matrix extension to as discussed regarding 1RDLP.
Definition 13.
The binary matrix resulting from binary matrix and -dimensional binary vectors and has entries equal to
| (2) |
Lemma 14.
There exists a -Dollo linear phylogeny for rooted at and terminating at if and only if there exists a -Dollo linear phylogeny for .
Corollary 15.
RT1DLP is solvable in time.
3.1.2 Additional Variants of 1DLP
The direct equivalence from 1DLP to the Consecutive Ones Property allows several known properties from the latter to apply to 1DLP. For example, allowing for false negatives, a typical phenomenon in single-cell DNA sequencing due to allelic dropout [24], yields the following problem.
Problem 5 (Minimum Error -Dollo Linear Phylogeny (ME1DLP)).
Given binary matrix for cells and SNVs, determine the minimum number of to replacements in such that the resulting matrix has a -Dollo linear phylogeny .
As the equivalent problem of determining the minimum number of -to- matrix modifications of any binary matrix to satisfy C1P is NP-hard [6], ME1DLP is also NP-hard.
Corollary 16.
ME1DLP is NP-hard.
In a similar vein, the 1DLP problem variant with at most two gains and at most two losses per character is equivalent to the C1P generalization observed in [8], which is NP-hard.
Corollary 17.
Determining whether a binary matrix admits a -Dollo linear phylogeny with at most two gains and at most two losses per character is NP-hard.
3.2 Recursive Characterization of 1DP
We begin by stating the rooted version of 1DP (Problem 1), which has no linear constraint.
Problem 6 (Rooted -Dollo Phylogeny (R1DP)).
Given binary matrix and binary vector , construct a -Dollo phylogeny for rooted at or determine that one does not exist.
In this section we will show how a 1DP problem instance can be recursively decomposed into smaller Rooted -Dollo Linear Phylogeny (R1DLP, Problem 3), Rooted, Terminated -Dollo Linear Phylogeny (RT1DLP, Problem 4) and Rooted -Dollo Phylogeny (R1DP, Problem 6) instances on submatrices of .
To that end, we introduce the notation to indicate the submatrix of induced by rows/taxa and columns/characters . Moreover, we define to be the restriction of binary vector to only characters . For example, the restriction of to characters equals . We use the shorthand to indicate that all node labels of have been restricted to , i.e. for all nodes of .
3.2.1 Recursive Characterization of Rooted -Dollo Phylogenies
Like every tree structure, a -Dollo phylogeny for a given binary matrix can be recursively characterized. Key to the characterization are branching points, which are internal nodes with more than one non-leaf child. Let be the first branching point encountered by a tree traversal on starting from its root labeled by . Let be labeled by and have non-leaf children. If no such node exists then is simply a -Dollo linear phylogeny for , corresponding to the base case of the recurrence.
If there exists a branching point then, on the tree traversal from to , we encounter a subset of taxa as well as identify sets of characters that were gained or lost solely on this traversal, respectively. Note that a character first gained and then lost on this traversal is present in both and . Also, note that may not be a subset of ; for instance, there may be a character that was previously gained such that that is subsequently lost prior to the branching point , leading to and . Let , and let be labeled by binary vector . The encountered nodes on the traversal from to induce a subtree such that its restriction is precisely a -Dollo linear phylogeny for submatrix rooted at and terminating at . To characterize the remainder of the tree, observe that performing a traversal of starting at the -th outgoing edge from yields a tree composed of taxa , gained characters and lost characters . Let . Since decomposing a tree cannot add new unique edges or nodes to the sum of its parts, we have that is precisely a -Dollo phylogeny for submatrix rooted at .
Lemma 18.
For a given binary vector and -Dollo phylogeny for matrix , let be the subtree of obtained by traversing from the node labeled by to a first branching point with label , and let be the subtrees of obtained by traversing along each of the outgoing edges from . Let , and be the gained characters, lost characters and observed taxa, respectively, in tree where . Let for all . Then, the following conditions hold.
-
(i)
Sets are pairwise disjoint, sets are pairwise disjoint, and sets are pairwise disjoint.
-
(ii)
Sets are pairwise disjoint, and is the set of all taxa observed in the subtree of rooted at .
-
(iii)
for all .
-
(iv)
Tree is a -Dollo linear phylogeny for submatrix rooted at terminating at .
-
(v)
For each , tree is a -Dollo phylogeny for submatrix rooted at .
Thus, at each branching point of with non-leaf children, we obtain a single instance of Rooted, Terminated -Dollo Linear Phylogeny (RT1DLP, Problem 4) and instances of Rooted -Dollo Phylogeny (R1DP, Problem 6). Each of these instances can be further decomposed in a recursive fashion by identifying subsequent branching points.
3.2.2 Recursive decomposition of matrix
The above recursive decomposition of a given -Dollo phylogeny for matrix rooted at some yields trees on submatrices , respectively. As a phylogeny can be decomposed, therefore, matrix can also be decomposed. However, it is far less apparent how to do so without prior knowledge of the -Dollo phylogeny on . Here, we describe how given and , submatrices can be inferred solely from the taxa set and the two character sets . We begin by noting how and character sets uniquely determine the label of a potential branching point, since there is a unique path from the root to the branching point containing gains and losses . To that end, we define the following function.
Definition 19.
Given binary vector and characters , the -dimensional binary vector consists of entries
| (3) |
Lemma 20.
Let be a -Dollo phylogeny for matrix . For any node labeled by and descendant node labeled by it holds that where and are the characters that are gained and lost, respectively, on the path from to .
Knowledge of sets , , and immediately implies knowledge of since . Additional knowledge of allows us to infer the terminal label of RT1DLP instance .
Thus, our only nontrivial goal is to infer submatrices defined by taxa and characters . We note that by the definition of a -Dollo phylogeny, only characters can be potentially gained or lost after the branching point labeled by , and each such can only be gained and potentially lost in specifically one tree . To detail whether any character must be gained or lost in -Dollo phylogeny on some proposed matrix , we provide the following definition.
Definition 21.
A character is variable w.r.t. an matrix and -dimensional vector if there exists a taxon such that .
Our goal therefore translates into determining a partition of and partition of characters such that each character is variable w.r.t. at most one submatrix and (where ). To that end, we define the following matrix .
Definition 22.
The complement matrix obtained from binary matrix and -dimensional vector has entries
| (4) |
where and .
Lemma 23.
Let be a submatrix of defined by characters and taxa and let . Then, character is variable w.r.t. if and only if contains a in column .
Whether a character is variable with respect to vector and submatrix thus corresponds to whether the character column in contains a . Since any character must be variable in at most one submatrix, our goal now finally equates to inferring an block diagonal matrix decomposition into block matrices of , i.e.
| (5) |
Definition 24.
Partition of taxa and partition of characters are a block diagonal decomposition of if, for every character , there exists no for all such that .
This can be trivially achieved in time by the equivalent problem of, given a bipartite graph’s adjacency matrix, determining its connected components. This equivalency also demonstrates that the block diagonal matrix decomposition of maximum size for any matrix is unique (excluding characters containing all values of 0 in , which can be trivially assigned to any and have no restricting effect on determining a phylogeny). Thus, we assume we always infer block diagonal matrix decompositions of maximum size. We finally synthesize , matrix complement , and this block diagonal decomposition to formalize a -Dollo matrix decomposition on and by , , and .
Definition 25.
Given binary matrix and binary vector , the -Dollo matrix decomposition of and on , , and is defined as the set of submatrices , , …, such that , and are each given by the block diagonal decomposition and of , where and , with maximum size .
Therefore, given binary matrix and -Dollo phylogeny for rooted at , we have established a -Dollo matrix decomposition of and on known sets , , and that yields , , …, . Of course, such a decomposition assumes that the values of , , and are indeed correct; that is, that , , and are constructed according to the aforementioned recursive characterization of established in Section 3.2.1. Without knowledge of , such values are not trivially known. We use the following definition and lemma to establish a necessary condition to , , and being constructed according to this recursive characterization of some , essentially dictating that every character lacking a gain or loss edge in cannot be variable across the taxa in .
Definition 26.
Sets , , are in agreement with and if all characters are not variable w.r.t submatrix and vector .
We now arrive at the main theorem of this section, where we prove that this sole necessary condition, in tandem with the -Dollo matrix decomposition of binary matrix and binary vector by into , is both necessary and sufficient to recursively characterize all for which there exists a -Dollo phylogeny rooted at some . Intuitively, we prove the forward direction by decomposing an existing -Dollo phylogeny on as shown in Section 3.2.1, deriving directly. We then show that each subtree beneath the first-encountered branching point of corresponds to some matrix resulting from the -Dollo matrix decomposition of and on these derived values. Conversely, we prove the reverse direction by beginning from a -Dollo matrix decomposition on some existing and directly constructing from phylogenies on the decomposition’s individual parts. As an aide, we show an example -Dollo phylogeny’s recursive decomposition, paralleled by its data matrix’s recursive -Dollo decomposition (Figure 4).
To precisely allow for the composition of phylogenies that each may be on distinct sets of characters, we introduce the notation which, given a vector restricted to characters and vector on all characters , re-expands to include all characters in by supplementing missing characters with values from . For example, given on characters and on the full set of characters, yields . Then, the shorthand indicates the expansion of all node labels of phylogeny , i.e. for all nodes of . We use this notation to state the following theorem.
Theorem 27.
Given matrix and binary vector , there exists a -Dollo phylogeny for rooted at if and only if there exists some set of taxa and sets of characters subject to the following conditions:
-
1.
Sets , , are in agreement with and .
-
2.
For the -Dollo matrix decomposition of and on , , into submatrices , there exists a -Dollo linear phylogeny for rooted at and terminating on and -Dollo phylogenies for rooted at , respectively.
4 Methods
We introduce Dolphyin (DOllo Linear PHYlogeny INference), a combinatorial algorithm that uses the above, recursive, combinatorial characterization of rooted -Dollo phylogenies to solve R1DP altogether. Given any binary matrix and binary vector , Dolphyin determines a -Dollo phylogeny on rooted at by exhaustively searching over all possible values of and determining a set of values such that (i) are in agreement with and and (ii) given the -Dollo matrix decomposition of on and into submatrices , there exists a -Dollo linear phylogeny on rooted at terminating on and -Dollo phylogenies on each rooted at . This -Dollo linear phylogeny can be inferred in time, and -Dollo phylogenies are then inferred through recursive calls of Dolphyin on matrices . Dolphyin then returns the -Dollo phylogeny rooted at on as the tree formed by the construction on described in Theorem 27. This same Theorem 27 shows that this entire procedure is necessary and sufficient to solve any instance of R1DP. As an additional optimization, Dolphyin preprocesses all matrices by first removing all duplicate columns and rows, as well as trivial columns that contain 0, 1, or s.
4.1 Heuristically determining candidate values of
The exhaustive search over as described above is necessary and sufficient to determine any existing -Dollo phylogeny. However, such a brute-force search over these sets can be computationally intensive, requiring roughly time. Additionally, in practical R1DP instances, the vast majority of possible values of , , and in agreement with and do not even yield a -Dollo linear phylogeny on rooted at and terminating on .
Therefore, we enhance Dolphyin’s performance with a practical heuristic that, prior to the fully exhaustive search, initially restricts the enumerated values of in agreement with and to a subset of candidate values where such a has already been pre-determined to exist. Specifically, heuristic FindLinearChains constructs a set of precomputed trees with corresponding values such that for every element , is guaranteed to be a -Dollo phylogeny rooted at and terminating on on . Intuitively, FindLinearChains considers every taxon individually and, assuming that taxon is a branching point in , attempt to pack as many taxa as possible into a linear phylogeny beginning with and ending with . Formally, we construct a directed acyclic graph over all taxa with source such an edge between taxa exists if every character observed in is not lost along the edge. Then, for every such path in this graph beginning with , we consider all taxa in this path and check if such a -Dollo linear phylogeny indeed exists on these taxa across all characters. Critically, we record this set of taxa , along with , and , if and only if a -Dollo linear phylogeny exists.
Even with the above heuristic, the worst-case running time of the initial recursive call in Dolphyin remains , where the additional factor corresponds to checking whether the -Dollo decomposition of by yields a valid -Dollo linear phylogeny. Thus, the overall worst-case running time of Dolphyin is .
4.2 Adapting Dolphyin to false negatives in data
When examining simulated datasets with false negative errors or real data, we modified Dolphyin to probabilistically employ error correction. Specifically, in every recursive call, Dolphyin randomly considers pairs of taxa with replacement and, if the normalized Hamming distance over characters between both taxa is less than or equal to some value , alter any character seen in one taxon to be present in both taxa. For each pair of taxa, one taxon was selected with uniform probability and the other was selected with probability inversely proportional to each row’s prevalence in the dataset. To perform analysis on any given dataset with errors, we initially let , which equates to no error correction, and iteratively increased by until Dolphyin returned a solution within a 10 second time limit.
5 Results
Dolphyin, and its subsequent analysis and comparison to SPhyR, was implemented on an Apple 2.3 GHz 8-Core Intel Core i9 Macbook Pro in C++11. Dolphyin is available at: https://github.com/elkebir-group/Dolphyin with commit hash fbf400f used for the experiments in this paper.
5.1 Results on simulated data
We first used Dolphyin to analyze simulated matrices of errorless, single-cell sequencing data with -Dollo ground truth phylogenies. These datasets had either cells and SNVs, with datasets per combination of and , and were previously used to benchmark the ILP-based -Dollo solver SPhyR (Figure 5a). Data was generated using ms [21], and full details of data generation and the data itself accessible from SPhyR’s initial publication [12]. Dolphyin found and returned errorless -Dollo phylogenies for all 540 examined instances. We found that Dolphyin remained competitive with SPhyR in the majority of test cases across all input sizes, with identical median running times of seconds for both methods across all instances. While Dolphyin slightly outperformed SPhyR on the smaller input sizes (for instances: seconds for Dolphyin and seconds for SPhyR), SPhyR slightly outperformed Dolphyin on larger input sizes (for instances: seconds for Dolphyin and seconds for SPhyR).
To assay Dolphyin’s ability to infer -Dollo phylogenies on datasets containing false negatives, we then augmented the simulated datasets of sizes , by flipping each in these datasets to a with a false-negative error rate of . We chose to augment matrices of these sizes because the number of characters was most comparable to that of the real AML data we examined afterwards. Similarly, we chose an error rate of in simulations because of its similarity to the estimated median false negative rate of predicted under the Mission Bio Tapestri sequencing technology producing real AML data (allelic dropout rate of 5.8%) [23]. Since Dolphyin’s false negative error correction is inherently probabilistic, we analyzed each dataset with , , or restarts of Dolphyin. We report the lowest inferred false negative rate of all restarts (Figure 5b), which corresponds to the -Dollo phylogeny best fitting the data. In the median case, Dolphyin inferred phylogenies with a false negative rate at or below the ground truth rate used to generate the data (; 1 restart: median rate of , 5 restarts: median rate of , 10 restarts: median rate of ). Predictably, we found that increasing the number of restarts decreased the error rate of the best -Dollo phylogeny inferred. As an example, we provide a -Dollo phylogeny returned by Dolphyin in the analysis of a dataset with taxa and characters (Figure 5c). While Dolphyin is based on the characterization of -Dollo linear phylogenies, it clearly determines -Dollo phylogenies with branching.
5.2 Results on AML data
Having used Dolphyin to analyze simulated datasets both with and without false-negative sequencing errors, we then used Dolphyin to analyze 99 real sets of AML single-cell sequencing data [23] processed in a previous work [31] with 5 restarts per dataset. Prior to analysis, we removed all cells with unsequenced or unknown characters, yielding a mean of taxa, or cells, and characters per dataset.
We show the error rates achieved on each dataset across all 5 restarts (Figure 6a), demonstrating a moderate level of consistency in inferred error rates between restarts (standard deviation between restarts, averaged over datasets of 0.0634). Taking the minimum over all 5 restarts, Dolphyin inferred phylogenies on the majority (55) of datasets with an estimated false negative error rate at or below , approximately matching the false negative rate experimentally estimated in the datas’ initial publication by Mission Bio Tapestri sequencing (median allelic dropout rate of 5.8%) [23] (Figure 6b). We speculate that for those datasets on which error rates much greater than were inferred, there are three possibilities. Firstly, false positives in the real data, while less likely than false negatives (estimated false positive rate of 1% in initial work [23]), may be preventing Dolphyin from finding phylogenies with realistic error rates. Secondly and thirdly, more than losses may be necessary to realistically explain these datasets, or simply more restarts of Dolphyin may be required to locate a realistic solution.
Finally, as a precise example of the -Dollo phylogenies inferred by Dolphyin, we supply the mutation-annotated phylogeny Dolphyin inferred on AML dataset 51 with size , with a false negative error rate of (Figure 6c). Dolphyin inferred that AML dataset 51 had a near-linear -Dollo phylogeny with a internal node of outdegree 3 furthest from the root.
6 Conclusion
This work examines the problem of inferring a -Dollo, or persistent phylogeny on single-cell sequencing DNA data for SNVs. We first examine the subcase in which our -Dollo phylogeny must be linear, and we prove an equivalence between whether a binary data matrix admits a -Dollo linear phylogeny and whether has the known consecutive ones property, which can be verified in polynomial time [16]. We also develop polynomial-time algorithms for natural extensions of the -Dollo linear subcase, such as the cases when we restrict our -Dollo phylogeny to be rooted at and/or terminate at some states. Using the linear subcase, we recursively characterize all binary matrices that admit a -Dollo phylogeny with a series of conditions that are provably necessary and sufficient. Unfortunately, determining whether a matrix is -Dollo using this characterization takes exponential time, leaving the hardness of the -Dollo phylogeny problem open. In addition, we show that the problem of minimizing false negatives in data as to admit a -Dollo linear phylogeny is NP-hard.
We use these theoretical results to develop Dolphyin, a combinatorial algorithm that infers -Dollo phylogenies from sequencing data. Dolphyin directly leverages our above characterization of matrices admitting -Dollo phylogenies by identifying -Dollo linear phylogenies on subsets of taxa and recursing on remaining taxa and characters. Dolphyin also incorporates probabilistic error correction and thus can also be applied to data with false negative sequencing errors. We use Dolphyin to first analyze errorless, simulated datasets and show that Dolphyin is runtime competitive with SPhyR [12], a previous ILP-based approach for inferring -Dollo phylogenies. We then apply Dolphyin to simulated datasets with false negative errors and demonstrate that Dolphyin, in the median case, infers -Dollo phylogenies with an inferred error rate at or below the ground truth rate. We finally apply Dolphyin to 99 real acute myeloid leukemia datasets [23] and find that Dolphyin infers -Dollo phylogenies on the majority of these datasets with an error rate at or below the previously, experimentally-estimated false negative error rate specific to the sequencing technology producing these datasets.
In future work, we may consider more advanced error correction schemes for more widely applying Dolphyin to existing datasets. We may also attempt to extend a similar, combinatorial and recursive framework to the -Dollo model of evolution for , or models of evolution with more than one gain. However, we note that even determining a specifically linear and errorless phylogeny with two gains or losses, per character, is NP-hard [8]. Additionally, while we may consider problem extensions such as determining a maximal number of taxa or characters admitting -Dollo phylogenies in data, we note that the equivalence of 1DLP to the consecutive ones property makes several natural formulations NP-hard in even the linear and errorless cases [22]. Finally, while we argue that the first recursive call of Dolphyin is in the worst case, we would like to derive a more precise running time taking into account all recursive calls.
In summary, our work adds to the theoretical body of knowledge on the -Dollo, or persistent phylogeny, model of evolution and provides a practical algorithm for inferring phylogenies that leverages these theoretical results. We hope that this combinatorial approach will aid advances in determining the complexity of -Dollo problems and their variants.
References
- [1] Richa Agarwala and David Fernandez-Baca. A polynomial-time algorithm for the perfect phylogeny problem when the number of character states is fixed. SIAM J. Comput., 23(6):1216–1224, December 1994. doi:10.1137/S0097539793244587.
- [2] Paola Bonizzoni, Chiara Braghin, Riccardo Dondi, and Gabriella Trucco. The binary perfect phylogeny with persistent characters. Theoretical Computer Science, 454:51–63, 2012. Formal and Natural Computing. doi:10.1016/j.tcs.2012.05.035.
- [3] Paola Bonizzoni, Anna Paola Carrieri, Gianluca Della Vedova, Raffaella Rizzi, and Gabriella Trucco. A colored graph approach to perfect phylogeny with persistent characters. Theoretical Computer Science, 658:60–73, 2017. Formal Languages and Automata: Models, Methods and Application In honour of the 70th birthday of Antonio Restivo. doi:10.1016/j.tcs.2016.08.015.
- [4] Paola Bonizzoni, Anna Paola Carrieri, Gianluca Della Vedova, and Gabriella Trucco. Explaining evolution via constrained persistent perfect phylogeny. BMC Genomics, 15, October 2014. doi:10.1186/1471-2164-15-S6-S10.
- [5] Kellogg S. Booth and George S. Lueker. Testing for the consecutive ones property, interval graphs, and graph planarity using pq-tree algorithms. Journal of Computer and System Sciences, 13(3):335–379, 1976. doi:10.1016/S0022-0000(76)80045-1.
- [6] Kellogg Speed Booth. PQ-tree algorithms. PhD thesis, University of California, Berkeley, 1975. AAI7615117.
- [7] Remco Bouckaert, Mareike Fischer, and Kristina Wicke. Combinatorial perspectives on dollo-k characters in phylogenetics. Advances in Applied Mathematics, 131:102252, 2021. doi:10.1016/j.aam.2021.102252.
- [8] Cedric Chauve, Jan Manuch, and Murray Patterson. Hardness results for the gapped consecutive-ones property. Discrete Applied Mathematics, 160, December 2009. doi:10.1016/j.dam.2012.03.019.
- [9] Junyan Dai, Tobias Rubel, Yunheng Han, and Erin Molloy. Dollo-cdp: a polynomial-time algorithm for the clade-constrained large dollo parsimony problem. Algorithms for Molecular Biology, 19, January 2024. doi:10.1186/s13015-023-00249-9.
- [10] Alexander Davis, Ruli Gao, and Nicholas Navin. Tumor evolution: Linear, branching, neutral or punctuated? Biochimica et Biophysica Acta (BBA) - Reviews on Cancer, 1867(2):151–161, 2017. Evolutionary principles - heterogeneity in cancer? doi:10.1016/j.bbcan.2017.01.003.
- [11] William H.E. Day, David S. Johnson, and David Sankoff. The computational complexity of inferring rooted phylogenies by parsimony. Mathematical Biosciences, 81(1):33–42, 1986. doi:10.1016/0025-5564(86)90161-6.
- [12] Mohammed El-Kebir. SPhyR: tumor phylogeny estimation from single-cell sequencing data under loss and error. Bioinformatics, 34(17):i671–i679, September 2018. doi:10.1093/bioinformatics/bty589.
- [13] Mohammed El-Kebir, Gryte Satas, Layla Oesper, and Benjamin J. Raphael. Inferring the Mutational History of a Tumor Using Multi-state Perfect Phylogeny Mixtures. Cell Systems, 3(1):43–53, July 2016. doi:10.1016/j.cels.2016.07.004.
- [14] Daniel W. Feng and Mohammed El-Kebir. Dolphyin. Software, swhId: swh:1:dir:61c0cd5f22da3a8e10cc6fdac70dd7d93ecb6be5 (visited on 2025-08-04). URL: https://github.com/elkebir-group/Dolphyin, doi:10.4230/artifacts.24317.
- [15] David Fernández-Baca. The Perfect Phylogeny Problem, pages 203–234. Springer US, Boston, MA, 2001. doi:10.1007/978-1-4613-0255-1_6.
- [16] Delbert Fulkerson and Oliver Gross. Incidence matrices and interval graphs. Pacific Journal of Mathematics, 15(3):835–855, September 1965. doi:10.2140/pjm.1965.15.835.
- [17] Leslie Ann Goldberg, Paul W. Goldberg, Cynthia A. Phillips, Elizabeth Sweedyk, and Tandy Warnow. Minimizing phylogenetic number to find good evolutionary trees. Discrete Applied Mathematics, 71(1):111–136, 1996. doi:10.1016/S0166-218X(96)00060-1.
- [18] Dan Gusfield. Efficient algorithms for inferring evolutionary trees. Networks, 21(1):19–28, 1991. doi:10.1002/net.3230210104.
- [19] Dan Gusfield. Persistent phylogeny: a galled-tree and integer linear programming approach. In Proceedings of the 6th ACM Conference on Bioinformatics, Computational Biology and Health Informatics, BCB ’15, pages 443–451, New York, NY, USA, 2015. Association for Computing Machinery. doi:10.1145/2808719.2808765.
- [20] Michel Habib and Juraj Stacho. Unique perfect phylogeny is NP-hard. In Proceedings of the 22nd Annual Conference on Combinatorial Pattern Matching, CPM’11, pages 132–146, Berlin, Heidelberg, 2011. Springer-Verlag. doi:10.1007/978-3-642-21458-5_13.
- [21] Richard R. Hudson. Generating samples under a wright–fisher neutral model of genetic variation. Bioinformatics, 18(2):337–338, February 2002. doi:10.1093/bioinformatics/18.2.337.
- [22] Witold LipskiJr. Generalizations of the consecutive ones property and related np-complete problems1. Fundamenta Informaticae, 6(1):53–69, 1983. doi:10.3233/FI-1983-6104.
- [23] Kiyomi Morita, Feng Wang, Katharina Jahn, Tianyuan Hu, Tomoyuki Tanaka, Yuya Sasaki, Jack Kuipers, Sanam Loghavi, Sa A. Wang, Yuanqing Yan, Ken Furudate, Jairo Matthews, Latasha Little, Curtis Gumbs, Jianhua Zhang, Xingzhi Song, Erika Thompson, Keyur P. Patel, Carlos E. Bueso-Ramos, Courtney D. DiNardo, Farhad Ravandi, Elias Jabbour, Michael Andreeff, Jorge Cortes, Kapil Bhalla, Guillermo Garcia-Manero, Hagop Kantarjian, Marina Konopleva, Daisuke Nakada, Nicholas Navin, Niko Beerenwinkel, P. Andrew Futreal, and Koichi Takahashi. Clonal evolution of acute myeloid leukemia revealed by high-throughput single-cell genomics. Nature communications, 11(1):5327–17, 2020.
- [24] Nicholas Navin, Jude Kendall, Jennifer Troge, Peter Andrews, Linda Rodgers, Jeanne McIndoo, Kerry Cook, Asya Stepansky, Dan Levy, Diane Esposito, Lakshmi Muthuswamy, Alex Krasnitz, W Richard McCombie, James Hicks, and Michael Wigler. Tumour evolution inferred by single-cell sequencing. Nature, 472(7341):90–94, April 2011.
- [25] Peter C. Nowell. The clonal evolution of tumor cell populations. Science, 194(4260):23–28, 1976. doi:10.1126/science.959840.
- [26] Teresa M. Przytycka, George B. Davis, Nan Song, and Dannie Durand. Graph theoretical insights into evolution of multidomain proteins. Journal of computational biology : a journal of computational molecular cell biology, 13 2:351–63, 2005. URL: https://api.semanticscholar.org/CorpusID:6639563.
- [27] Erfan Sadeqi Azer, Mohammad Haghir Ebrahimabadi, Salem Malikić, Roni Khardon, and S. Cenk Sahinalp. Tumor phylogeny topology inference via deep learning. iScience, 23(11):101655, 2020. doi:10.1016/j.isci.2020.101655.
- [28] Palash Sashittal, Haochen Zhang, Christine A. Iacobuzio-Donahue, and BenjaminJ. Raphael. Condor: tumor phylogeny inference with a copy-number constrained mutation loss model. Genome Biology, 24(1):272–23, 2023.
- [29] Russell Schwartz and Alejandro A Schäffer. The evolution of tumour phylogenetics: principles and practice. Nature reviews. Genetics, 18(4):213–229, April 2017.
- [30] Doris Tabassum and Kornelia Polyak. Tumorigenesis: It takes a village. Nature reviews. Cancer, 15, July 2015. doi:10.1038/nrc3971.
- [31] Leah L. Weber and Mohammed El-Kebir. Phyolin: Identifying a Linear Perfect Phylogeny in Single-Cell DNA Sequencing Data of Tumors. In Carl Kingsford and Nadia Pisanti, editors, 20th International Workshop on Algorithms in Bioinformatics (WABI 2020), volume 172 of Leibniz International Proceedings in Informatics (LIPIcs), pages 5:1–5:14, Dagstuhl, Germany, 2020. Schloss Dagstuhl – Leibniz-Zentrum für Informatik. doi:10.4230/LIPIcs.WABI.2020.5.
Appendix A Proofs
Proof (Lemma 4).
Let be a -Dollo phylogeny for matrix . Let be the set of internal, non-root nodes of that do not correspond to observed taxa nor are branching points (Definition 3). If is empty, then subsequently contracting all internal edges such that yields the unique compact -Dollo phylogeny for matrix obtained from . If is non-empty, consider any node and let be the parent of (which exists as ). Since is an internal node, we have that does not correspond to a taxon and that is not a branching point. Therefore, has exactly one child . We obtain by removing the edges and , removing the node , and inserting the new edge . Clearly, remains a -Dollo phylogeny for . Removing all nodes in (in any order) and subsequently contracting all internal edges such that yields the unique compact -Dollo phylogeny for matrix obtained from .
Proof (Theorem 7).
Let be a -Dollo linear phylogeny for matrix . By Definition 1, we have that has exactly leaves with incoming edges such that . Moreover, removal of these leaves results in a linear, chain graph , containing nodes and root . Performing a pre-order traversal of starting from yields an ordering of the nodes and therefore the taxa of . Note that by Definition 1 each character is gained at most once, and if gained, lost at most once. Therefore, for each character , all taxa such that will appear consecutively in . Hence, is C1P as certified by permutation .
Let be C1P. As such, there exists a permutation such that for each column there exists at most one row such that and and at most one row such that and . Let be the tree obtained from following Definition 6. We claim that is a -Dollo phylogeny for in line with Definition 1. Clearly, each node of is labeled by a binary vector (condition (i)), the root is labeled by (condition (ii)) and each taxon corresponds to a unique leaf with parent such that (condition (iii)). It remains to show that there is at most one gain edge and at most one loss edge (condition (iv)) for each character . However, since the permutation details at most one interval of consecutive taxa for each character , it follows that taxa in gain through at most one gain edge and lose through at most one subsequent loss edge. Condition (v) is similar satisfied with the further specification that there is no gain edge for any character where . Hence, is a -Dollo phylogeny for .
Proof (Lemma 10).
Let be a -Dollo linear phylogeny for matrix rooted at . Moreover let be the matrix obtained from and following Definition 9. Given , we will construct a -Dollo linear phylogeny for . Specifically, we construct from by re-rooting on an appended node labeled by the -dimensional vector and adding the edge . We additionally add leaf labeled by and include the edge . Then, we extend the -dimensional binary vector for each node of with an additional entry defined as
| (6) |
We claim that is a -Dollo linear phylogeny for . Clearly, is node-labeled and rooted at . By virtue of the fact that is a -Dollo linear phylogeny for matrix rooted at , every character has at most one gain edge and one loss edge in , and every original taxon is present in . Additionally, character has at most one gain edge outgoing from and no loss edges, and the new taxon correspond to the leaf . Finally, the construction of from retains the linearity of .
Let be a -Dollo linear phylogeny for such that is compact. Following Theorem 7, let be the permutation such that .
Since is linear, the root node of must have at most one non-leaf child. If no such node exists then trivially has taxa and the tree consisting of a single node labeled by is a -Dollo linear phylogeny for rooted at .
We now focus on the case where has taxa. Let be the non-leaf child of the root node of . Since for all and , it must be the case that either or . To see why observe that would imply that character would has two gain edges within , violating the definition of a -Dollo linear phylogeny. We may assume without loss of generality on that , since if , then for the permutation such that reverses , tree would be a -Dollo linear phylogeny for such that . Therefore, since , it must be the case that is labeled by .
Given , we will now construct a -Dollo linear phylogeny for matrix rooted at . Let be the leaf of whose parent is . Specifically, we define as the subtree of rooted at that excludes the leaf . We relabel each node of omitting the th entry of its original label .
Since contains all gain and loss edges present in precisely excluding edges and , contains all taxa in precisely excluding leaf taxon , and has root label , we have that is a -Dollo linear phylogeny for rooted at .
Proof (Lemma 14).
Let be a -Dollo linear phylogeny for matrix rooted at and terminating at . Moreover let be the matrix obtained from , and following Definition 13. Given , we will construct a -Dollo linear phylogeny for . Let be the root of . Since is linear, removal of the leaves corresponding to the taxa yields a directed chain graph. Let be the sink node of this graph. Specifically, we construct from by re-rooting on an appended node labeled by the -dimensional vector and adding the edge . We additionally add leaf labeled by and include the edge . Next, we add a leaf labeled by and include the edge . Then, we extend the -dimensional binary vector for each node of with an two additional entries and defined as
| (7) |
We claim that is a -Dollo linear phylogeny for . Clearly, is node-labeled and rooted at . By virtue of the fact that is a -Dollo linear phylogeny for matrix rooted at , every character has at most one gain edge and one loss edge in , and every original taxon is present in . Character has at most one gain edge outgoing from and no loss edges, and the new taxon corresponds to the leaf . Character has at most one gain edge and at most one loss edge incoming to , and the new taxon corresponds to the leaf . Finally, the construction of from retains the linearity of .
Let be a -Dollo linear phylogeny for such that is compact. Following Theorem 7, let be the permutation such that . Since is linear, the root node of must have at most one non-leaf child. If no such node exists then trivially has taxa and the tree consisting of nodes with edge such that is labeled by and is labeled by is a -Dollo linear phylogeny for rooted at terminating at .
We now focus on the case where has taxa. Let be the non-leaf child of the root node of , and let be the sink node of the tree constructed by removing all leaves from . Since for all and , it must be the case that either or . To see why observe that would imply that character would has two gain edges within , violating the definition of a -Dollo linear phylogeny. By the same logic, since for all and , it must be the case that either or . However, it is clear that . Therefore, it must be the case that either (i) and , or (ii) and .
We may assume without loss of generality on that this first case holds, namely, that and . If and , then for the permutation such that reverses , tree would be a -Dollo linear phylogeny for such that . Since and , it must be the case that is labeled by and is labeled by .
Given , we will now construct a -Dollo linear phylogeny for matrix rooted at . Let be the leaf of whose parent is , and let be the leaf of whose parent is . Specifically, we define as the subtree of rooted at that excludes the leaf and leaf . We relabel each node of omitting the th and th entries of its original label .
Proof (Lemma 18).
We will prove the above conditions in the order they were proposed.
-
(i)
Since itself is a -Dollo phylogeny and characters can thus only be gained and lost once throughout all of , it holds that and are each pairwise disjoint. Additionally, sets and for all distinct must be disjoint, since no character can be gained in one subtree and lost in another subtree rooted at the same branching point . Hence, must additionally be pairwise disjoint.
-
(ii)
Sets are pairwise disjoint since, by definition, each taxon is observed as a leaf exactly once in , and these subsets are obtaining by a traversal on . Additionally, their union must comprise the set of all taxa in the subtree of rooted at , since is simply a partition of all taxa rooted under .
-
(iii)
It holds that , since previously-lost characters cannot be regained in any where .
-
(iv)
By construction, is a rooted, node-labeled tree. Since this tree is formed precisely by the traversal from node to the first encountered branching point of without traversing any children of , itself has no branching points and is thus linear. Since is rooted at , must be rooted at .
-
(v)
By construction, is a rooted, node-labeled tree. Since is formed precisely by the traversal of along the -th outgoing edge from node labeled by , is rooted at .
Proof (Lemma 20).
This follows directly from Definition 1.
Proof (Lemma 23).
Given binary matrices and , consider some variable character w.r.t. and . Therefore, there is some taxon such that . We distinguish two cases.
Given binary matrices and , consider some character such that contains a in column , that is, there is some taxon such that . We distinguish two cases.
-
1.
If , then . Thus, , so is variable w.r.t. and .
-
2.
If , then . Thus, , so is variable w.r.t. and .
Proof (Theorem 27).
Consider a compact -Dollo phylogeny for matrix rooted at . Let be the subtree of obtained by traversing from root node labeled by to a first branching point with label , and let , and be the gained characters, lost characters and observed taxa, respectively, in tree . Since every character not gained or lost in can never change across the taxa in , sets , , must be in agreement with and . So Condition 1 holds.
Let be the -Dollo matrix decomposition of and on , , . By Lemma 18, then, the tree induced on by taxa set is a -Dollo linear phylogeny for rooted at and terminating at .
We must finally show that there exist -Dollo phylogenies , for , respectively rooted at . However, consider the existing subtrees of obtained by traversing along each of the outgoing edges from indexed by , and let let , and be the gained characters, lost characters and observed taxa, respectively, in each tree . By Lemma 18, these trees are already exactly -Dollo phylogenies for rooted at . So we will show that for every , there is some value of , with existing tree , such that . To do this, we will show outright that .
By Lemma 18, are a partition of characters and are a partition of taxa . Since is a -Dollo phylogeny for , every character is variable w.r.t. and and not variable with respect to and for such that . So by Lemma 23 and the definition of a complement matrix, and are precisely a block diagonal decomposition of .
Additionally, and must be such a block diagonal matrix decomposition of maximum size. We will prove this by contradiction. If and was not on maximum size, it would be the case that for some , that there would exist two submatrices and such that and partition , and partition , all characters were not variable w.r.t. and , and all characters were not variable w.r.t. and . This implies that there exists two subtrees and of such that contains all taxa in and all gain or loss edges of , contains all taxa in and all gain or loss edges of , and and are disjoint from each other. But then, this implies that edge in from to the root node of is not a gain or loss edge for any character . Since is compact, this cannot be the case.
By the definition of a -Dollo decomposition, the sets of taxa and characters defining also comprise a block diagonal decomposition of of maximum size. But the block diagonal matrix decomposition of maximum size for any matrix is unique, so it must be true that . So for every submatrix , there exists a -Dollo phylogeny for rooted at . So Condition 2 holds.
Given binary matrix and binary vector , let there exist and such that (i) , , and are in agreement with and and (ii) the -Dollo matrix decomposition of and on , , yields such that there exists -Dollo linear phylogeny for rooted at and terminating on and -Dollo phylogenies for , respectively, rooted at for .
We will construct . Let node be the node of labeled by , and let for all be the root node of labeled by . Then, add edge for all to the composite of , and subsequently contract all such edges.
Each node of is clearly labeled, and the root of is labeled by , so clearly is rooted at . First, we prove that for every character such that , there must be no gain edge on in . There is no gain edge for in prior to , since has no gain edge for by definition of rooted -Dollo linear phylogeny rooted at . To demonstrate that there is no gain edge for in after , we differentiate two cases:
-
1.
. Then, for all has no gain edge for by definition of rooted -Dollo phylogeny rooted at . So there is no gain edge for in after encountering , either, by construction of .
-
2.
. Then, it must be the case that . But since , has no gain edge for for all . So there is no gain edge for in after encountering , either, by construction of .
Second, we prove that for every character such that , there must be at most one loss edge on in . To demonstrate that there is at most one loss edge for in after encountering , we differentiate two cases:
-
1.
. Then, clearly was not lost in . By the definition of a block diagonal decomposition, we know that must be variable w.r.t. and for at most one value of . So there must be at most one loss edge in .
-
2.
. Then, must have been lost in . But since , has no gain edge, and thus no loss edge, on for all .
Third, we prove that for every character such that , there must be at most one gain and at most one loss edge on in . We prove three statements that together demonstrate this in full:
-
1.
For any character , there cannot be a gain edge in and a gain edge in for . If there was, then . But since , so , cannot have a gain edge, and thus cannot have a loss edge, on for all .
-
2.
For any character , there cannot be a loss edge in and a loss edge in for . If there was, then . But since , cannot have a gain edge, and thus cannot have a loss edge, on for all .
-
3.
For any character , there cannot be a gain edge in both and for distinct such that . This follows from the definition of a block matrix decomposition, since must be variable w.r.t. and for at most one value of .
So is a valid -Dollo phylogeny for , and we are done.
