Advancing Divide-and-Conquer Phylogeny Estimation using Robinson-Foulds Supertrees

One of the Grand Challenges in Science is the construction of the Tree of Life, an evolutionary tree containing several million species, spanning all life on earth. However, the construction of the Tree of Life is enormously computationally challenging, as all the current most accurate methods are either heuristics for NP-hard optimization problems or Bayesian MCMC methods that sample from tree space. One of the most promising approaches for improving scalability and accuracy for phylogeny estimation uses divide-and-conquer: a set of species is divided into overlapping subsets, trees are constructed on the subsets, and then merged together using a “supertree method”. Here, we present Exact-RFS-2, the first polynomial-time algorithm to find an optimal supertree of two trees, using the Robinson-Foulds Supertree (RFS) criterion (a major approach in supertree estimation that is related to maximum likelihood supertrees), and we prove that finding the RFS of three input trees is NP-hard. We also present GreedyRFS (a greedy heuristic that operates by repeatedly using Exact-RFS-2 on pairs of trees, until all the trees are merged into a single supertree). We evaluate Exact-RFS-2 and GreedyRFS, and show that they have better accuracy than the current leading heuristic for RFS. Exact-RFS-2 and GreedyRFS are available in open source form on Github at github.com/yuxilin51/GreedyRFS.


Introduction
Supertree construction (i.e., the combination of a collection of trees, each on a potentially different subset of the species, into a tree on the full set of species) is a natural algorithmic problem that has important applications to computational biology; see [7] for a 2004 book on the subject and [9, 10, 14, 16-18, 27, 40] for some of the recent papers on this subject. Supertree methods are particularly important for large-scale phylogeny estimation, where it can be used as a final step in a divide-and-conquer pipeline [48]: the species set is divided into two or more overlapping subsets, unrooted leaf-labelled trees are constructed (possibly recursively) on each subset, and then these subset trees are combined into a tree on the full dataset, using the selected supertree method. Furthermore, provided that optimal supertrees are computed, divide-and-conquer pipelines can be provably statistically consistent under stochastic models of evolution: i.e., as the amount of input data (e.g., sequence lengths when estimating gene trees, or number of gene trees when estimating species trees) increases, the probability that the true tree is returned converges to 1 [26,47].
Unfortunately, the most accurate supertree methods are typically heuristics for NP-hard optimization problems [4,16,27,29,31,35,36,40], and so are computationally intensive. However, divide-and-conquer strategies, especially recursive ones, may only need to apply supertree methods to two trees at a time, and hence the computational complexity of supertree estimation given two trees is of interest. One optimization problem where optimal supertrees can be found on two trees is the NP-hard Maximum Agreement Supertree (SMAST) problem (also known as the Agreement Supertree Taxon Removal problem), which removes a minimum number of leaves so that the reduced trees have an agreement supertree [14,17]. Similarly, the Maximum Compatible Supertree (SMCT) problem, which removes a minimum number of leaves so that the reduced trees have a compatibility supertree [5,6], can also be solved in polynomial time on two trees (and note that SMAST and SMCT are identical when the input trees are fully resolved). Because SMAST and SMCT remove taxa, methods for these optimization problems are not true supertree methods, because they do not return a tree on the entire set of taxa. However, solutions to SMAST and SMCT could potentially be used as constraints for other supertree methods, where the deleted leaves are added into the computed SMAST or SMCT trees, so as to optimize the desired criterion.
When restricting to methods that return trees on the full set of taxa, much less seems to be understood about finding supertrees on two trees. However, if the two input trees are compatible (i.e., there is a supertree that equals or refines each tree when restricted to the respective leaf set), then finding that compatibility supertree is solvable in polynomial time, using (for example) the well known BUILD algorithm [1], but more efficient algorithms exist (e.g., [3,6]).
Since compatibility is a strong requirement (rarely seen in biological datasets), optimization problems are more relevant. One optimization problem worth discussing is the Maximum Agreement Supertree Edge Contraction problem (which takes as input a set of rooted trees and seeks a minimum number of edges to col-lapse so that an agreement supertree exists). This problem is NP-hard, but the decision problem can be solved in O((2k) p kn 2 ) time when the input has k trees and p is the allowed number of number of edges to be collapsed [14]. Note that this analysis means that their algorithm for AST-EC may be exponential even for two trees, when the number of edges that must be collapsed is Ω(n) (e.g., imagine two caterpillar trees, where one is obtained from the other by moving the left-most leaf to the rightmost position).
In sum, while supertree methods are important and well studied, when restricted to the major optimization problems that do not remove taxa, polynomial time methods do not seem to be available, even for the special case where the input contains just two trees. This restriction has consequences for large-scale phylogeny estimation, as without good supertree methods, divide-and-conquer pipelines are not guaranteed to be statistically consistent, are not fast, and do not have good scalability [47].
In this paper we examine the well known Robinson-Foulds Supertree (RFS) problem [2], which seeks a supertree that minimizes the total Robinson-Foulds [32] distance to the input trees. Although RFS is NP-hard [22], it has several desirable properties: it is closely related to maximum likelihood supertrees [38] and, as shown very recently, has good theoretical performance for species tree estimation in the presence of gene duplication and loss [25]. Because of its importance, there are several methods for RFS supertrees, including PluMiST [18], MulRF [8], and FastRFS [44]. A comparison between FastRFS and other supertree methods (MRL [27], ASTRAL, ASTRID [43], PluMiST, and MulRF) on simulated and biological supertree datasets showed that FastRFS matched or improved on the other methods with respect to topological accuracy and RFS criterion scores [44]. Hence, FastRFS is currently the leading method for the RFS optimization problem.
The main contributions of this paper are: -We prove that RFS is solvable in O(n 2 |X|) time for two trees, where n is the number of leaves and X is the number of shared leaves (Theorem 1) and NP-hard for three or more trees (Lemma 6). -We present Exact-RFS-2, a polynomial time algorithm for the RFS problem when given only two source trees, and explore its performance on simulated data, both within a natural divide-and-conquer pipeline and within a greedy heuristic (Section 3). We show that Exact-RFS-2 outperforms Fas-tRFS [44] on two trees, the current most accurate method for RFS, and that GreedyRFS is better than FastRFS for small to moderate numbers of source trees (Section 4). -We prove that divide-and-conquer pipelines using Exact-RFS-2 are statistically consistent methods for phylogenetic tree estimation (both gene trees and species trees) under standard sequence evolution models (Theorem 2). -We establish equivalence between RFS and some other supertree problems (Lemma 1). -We show critical differences between RFS and SMAST/SMCT problems, that establish that methods for SMAST or SMCT cannot provably be used to constrain the search for RFS supertrees (Lemma 23).
The remainder of the paper is organized as follows. In Section 2, we provide terminology and define the optimization problems we consider. We present the Exact-RFS-2 algorithm and establish theory related to the algorithm in Section 3. Our experimental performance study is presented in Section 4, and we conclude in Section 5 with a discussion of trends and future research directions.
The details of the performance study and commands necessary to reproduce the study are omitted from the main paper but available in the appendices; the proofs are also available in [50]. All datasets used in this study are publicly available from prior studies, and the scripts and codes necessary to reproduce the study are available at http://github.com/yuxilin51/GreedyRFS.

Terminology and Problem Statements
We let [N ] = {1, 2, . . . , N } and A = {T i | i ∈ [N ]} denote the input to a supertree problem, where each T i is a phylogenetic tree on leaf set L(T i ) = S i ⊆ S (where L(t) denotes the leaf set of t) and the output is a tree T where L(T ) is the set of all species that appear as a leaf in at least one tree in A, which we will assume is all of S. We use the standard supertree terminology, and refer to the trees in A as "source trees" and the set A as a "profile".

Robinson-Foulds Supertree
Each edge e in a tree T defines a bipartition π e := [A|B] of the leaf set, and each tree is defined by the set C(T ) := {π e | e ∈ E(T )}. The Robinson-Foulds distance [32] (also called the bipartition distance) between trees T and T with the same leaf set is RF(T, T ) := |C(T )\C(T )| + |C(T )\C(T )|. We extend the definition of RF distance to allow for T and T to have different leaf sets as follows: RF (T, T ) := RF (T | X , T | X ), where X is the shared leaf set and t| X denotes the homeomorphic subtree of t induced by X. Letting T S denote the set of all phylogenetic trees such that L(T ) = S and T B S denote the binary trees in T S , then a Robinson-Foulds supertree [2] of a profile A is a binary tree We let RF(T, A) := i∈[N ] RF(T, T i ) denote the RFS score of T with respect to profile A. Thus, the Robinson-Foulds Supertree problem takes as input the profile A and seeks a Robinson-Foulds (RF) supertree for A, which we denote by RFS(A).

Split Fit Supertree
The Split Fit (SF) Supertree problem was introduced in [49], and is based on optimizing the number of shared splits (i.e., bipartitions) between the supertree and the source trees. For two trees T , T with the same leaf set, the split support is the number of shared bipartitions, i.e., SF(T, T ) := |C(T ) ∩ C(T )|. For trees with different leaf sets, we restrict them to the shared leaf set before calculating the split support. The Split Fit supertree for a profile A of source trees, denoted SFS(A), is a tree T SFS ∈ T B S such that Thus, the split support score of T with respect to A is SF(T, The Split Fit Supertree (SFS) problem takes as input the profile A and seeks a Split Fit supertree (the supertree with the maximum split support score), which we denote by SFS(A).
Nomenclature for variants of RFS and SFS problems -The relaxed versions of the problems where we do not require the output to be binary (i.e., we allow T ∈ T S ) are named Relax-RFS and Relax-SFS. -We append "-N " to the name to indicate that we assume there are N source trees. If no number is specified then the number of source trees is unconstrained. -We append "-B" to the name to indicate that the source trees are required to be binary; hence, we indicate that the source trees are allowed to be non-binary by not appending -B.
Thus, the RFS problem with two binary input trees is RFS-2-B and the relaxed SFS problem with three (not necessarily binary) input trees is Relax-SFS-3.
Other notation For any v ∈ V (T ), we let N T (v) denote the set of neighbors of v in T . A tree T is a refinement of T iff T can be obtained from T by contracting a set of edges. Two bipartitions π 1 and π 2 of the same leaf set are said to be compatible if and only if there exists a tree T such that π i ∈ C(T ), i = 1, 2. A bipartition π = [A|B] restricted to a subset R is π| R = [A ∩ R|B ∩ R]. For a graph G and a set F of vertices or edges, we use G + F to represent the graph obtained from adding the set F of vertices or edges to G, and G − F is defined for deletions, similarly.

Theoretical Results
In this section we establish the main theoretical results, with detailed proofs provided in [50] or in the appendix of the full version on bioRxiv.

Solving RFS and SFS on two trees
Lemma 1. Given an input set A of source trees, a tree T ∈ T B S is an optimal solution for RFS(A) if and only if it is an optimal solution for SFS(A).
The main result of this paper is Theorem 1 (correctness is proved later within the main body of the paper, and the running time is established in the Appendix): Fig. 1: T 1 and T 2 depicted in (a) and (b) have an overlapping leaf set X = {l 1 , l 2 , . . . , l 7 }. Each of a 1 , . . . , a 6 and b 1 , . . . , b 6 can represent a multi-leaf extra subtree. Using indices to represent the shared leaves, let π = [12|34567]; then e 1 (π) = e and e 2 (π) = e . P (e) is the path from v 1 to v 4 , so w(e) = 3. The input to Exact-RFS-2 is a pair of binary trees T 1 and T 2 . Let X denote the set of shared leaves. At a high level, Exact-RFS-2 constructs a tree T init that has a central node that is adjacent to every leaf in X and to the root of every "rooted extra subtree" (a term we define below) so that T init contains all the leaves in S. It then modifies T init by repeatedly refining it to add specific desired bipartitions, to produce an optimal Split Fit (and optimal Robinson-Foulds) supertree ( Figure 3). The bipartitions that are added are defined by a maximum independent set in a bipartite "weighted incompatibility graph" we compute.
Additional notation Let Π = 2 X denote the set of all bipartitions of X; any bipartition that splits a single leaf from the remaining |X|−1 leaves will be called "trivial" and the others will be called "non-trivial". Let C(T 1 , T 2 , X) denote C(T 1 | X ) ∪ C(T 2 | X ), and let Triv and NonTriv denote the sets of trivial and non-trivial bipartitions in C(T 1 , T 2 , X), respectively. We refer to T i | X , i = 1, 2 as backbone trees ( Figure 2). Recall that we suppress degree-two vertices when restricting a tree T i to a subset X of the leaves; hence, every edge e in T i | X will correspond to an edge or a path in T (see Fig. 1 for an example). We will let P (e) denote the path associated to edge e, and let w(e) := |P (e)| (the number of edges in P (e)). Finally, for π ∈ C(T i | X ), we define e i (π) to be the edge that induces π in T i | X (Fig. 1). The next concept we introduce is the set of extra subtrees, which are rooted subtrees of T 1 and T 2 , formed by deleting X and all the edges and vertices on paths between vertices in X (i.e., we delete T i | X from T i ). Each component in T i − T i | X is called an extra subtree of T i , and note that the extra subtree t is naturally seen as rooted at the unique vertex r(t) that is adjacent to a vertex in (c) incompatibility graph π 1 π 2 π 3 π 4 π 5 π 6 π 7 π 8 Fig. 2: We show (a) T 1 | X , (b) T 2 | X , and (c) their incompatibility graph, based on the trees T 1 and T 2 in Figure 1 (without the trivial bipartitions). Each π i is the bipartition induced by e i , and the weights for π 1 , . . . , π 8 are 3, 4, 1, 1, 2, 2, 2, 3, in that order. We note that π 1 and π 5 are the same bipartition, but they have different weights as they are induced by different edges; similarly for π 3 and π 7 . The maximum weight independent set in this graph has all the isolated vertices (π 1 , π 3 , π 5 , π 7 ) and also π 2 , π 8 , and so has total weight 15.
We can now define the initial tree T init computed by Exact-RFS-2: T init has a center node that is adjacent to every x ∈ X and also to the root r(t) for every extra subtree t ∈ Extra(T 1 ) ∪ Extra(T 2 ). Note that T init has a leaf for every element in S, and that T init | Si is a contraction of T i , formed by collapsing all the edges in the backbone tree T i | X .
We say that an extra subtree t is attached to edge e ∈ E(T i | X ) if the root of t is adjacent to an internal node of P (e), and we let T R(e) denote the set of such extra subtrees attached to edge e. Similarly, if π ∈ C(T 1 , T 2 , X), we let T R * (π) refer to the set of extra subtrees that attach to edges in a backbone tree that induce π in either T 1 | X or T 2 | X . For example, if both trees T 1 and T 2 contribute extra subtrees to π, then T R * (π) := i∈ [2] T R(e i (π)).
For any Q ⊆ X, we let BP i (Q) denote the set of bipartitions in C(T i | X ) that have one side being a strict subset of Q, and we let T RS i (Q) denote the set of extra subtrees associated with these bipartitions. In other words, BP i (Q) := {[A|B] ∈ C(T i | X ) | A Q or B Q}, and T RS i (Q) := π∈BPi(Q) T R(e i (π)). Intuitively, T RS i (Q) denotes the set of extra subtrees in T i that are "on the side of Q". By Corollary 2, which appears in the Appendix, for any π = [A|B] ∈ C(T i | X ), BP i (A)∪BP i (B) is the set of bipartitions in C(T i | X ) that are compatible with π. Finally, let BP(Q) = BP 1 (Q)∪BP 2 (Q), and T RS(Q) = T RS 1 (Q)∪ T RS 2 (Q). We give an example for these terms in Figure 1.
The incompatibility graph of a set of trees, each on the same set of leaves, has one vertex for each bipartition in any tree (and note that bipartitions can appear more than once) and edges between bipartitions if they are incompatible (see [30]). We compute a weighted incompatibility graph for the pair of trees T 1 | X and T 2 | X , in which the weight of the vertex corresponding to bipartition π appearing in tree T i | X is w(e i (π)), as defined previously. Thus, if a bipartition is common to the two trees, it produces two vertices in the weighted incompatibility graph, and each vertex has its own weight ( Figure 2).

4:
compute T R(ei(π)), w(ei(π)) 5: compute BP(A), BP(B), T RS(A), T RS(B), and T R * (π), 6: construct T as a star tree with leaf set X and center vertexv and with the root of each t ∈ Extra(T1) ∪ Extra(T2) connected tov by an edge let Tinit = T 7: construct the weighted incompatibility graph G of T1| X and T2| X 8: compute the maximum weight independent set I * in G 9: let I be the set of bipartitions associated with vertices in I * 10: for each π = [{a}|B] ∈ Triv do 11: detach all extra subtrees in T R * (π) fromv and attach them onto (v, a) such that T R(e1(π)) are attached first with their ordering matching their attachments on e1(π) and T R(e2(π)) are attached to the right of all subtrees in T R(e1(π)) with the ordering of them also matching their attachments on e2(π) letT = T after for loop 12: H(v) = NonTriv, set sv(π) =v for all π ∈ NonTriv 13: for each π ∈ NonTriv ∩ I (in any order) do 14: T ← Refine(T, π, H, sv) let T * = T after for loop 15: arbitrarily refine T to make it a binary tree 16: return T Π X is the set of bipartitions from the input trees that are induced by edges in the minimal subtree of T 1 or T 2 spanning X, and Π Y are all the other input tree bipartitions. We define p X (·) and p Y (·) on trees T ∈ T S by: Note that p X (T ) and p Y (T ) decompose the split support score of T into the score contributed by bipartitions in Π X and the score contributed by bipartitions in Π Y ; thus, the split support score of T with respect to T 1 , T 2 is p X (T ) + p Y (T ). As we will show, the two scores can be maximized independently and we can use this observation to refine T init so that it achieves the optimal total score.
Overview of Exact-RFS-2 Exact-RFS-2 (Algorithm 1) has four phases. In the pre-processing phase (lines 1-5), it computes the weight function w and the mappings T R, T R * , BP, and T RS for use in latter parts of Algorithm 1 and Algorithm 2. In the initial construction phase (line 6), it constructs a tree T init (as described earlier), and we note that T init maximizes p Y (·) score (Lemma 2). In the refinement phase (lines 7-14), it refines T init so that it attains the maximum p X (·) score. In the last phase (line 15), it arbitrarily refines T to make it binary. The refinement phase begins with the construction of a weighted incompatibility graph G of T 1 | X and T 2 | X (see Figure 2). It then finds a maximum weight independent set of G that defines a set I ⊆ C(T 1 , T 2 , X) of compatible bipartitions of X. Finally, it uses these bipartitions of X in I to refine T init to achieve the optimal p X (·) score, by repeatedly applying Algorithm 2 for each π ∈ I (and we note that the order does not matter). See Figure 3 for an example of Exact-RFS-2 given two input source trees.
(a) Tinit: star with leaf set X and all extra subtrees attached to center (b)T : after adding all Triv to T |X Algorithm 2 refines the given tree T on leaf set S with bipartitions on X from C(T 1 , T 2 , X) \ C(T | X ). Given bipartition π = [A|B] on X, Algorithm 2 produces a refinement T of T such that C(

Algorithm 2 Refine
Input: a tree T on leaf set S, a nontrivial bipartition π = [A|B] of X, two data structures H and sv Output: a tree T which is a refinement of T such that for both i = 1, 2, are attached first with their ordering matching their attachments on e1(π) and T R(e2(π)) are attached to the right of all subtrees in T R(e1(π)) with the ordering of them also matching their attachments on e2(π) if t is attached to v, detach it and attach to va 10: for each t ∈ T RS(B) do 11: if t is attached to v, detach it and attach to v b 12: for each remaining extra subtree attached to v do 13: detach it from v and attach it to either va or v b To do this, we first find the unique vertex v such that no component of T − v has leaves from both A and B. We create two new vertices v a and v b with an edge between them. We divide the neighbor set of v into three sets: N A is the set of neighbors that split v from leaves in A, N B is the set of neighbors that split v from leaves in B, and N other contains the remaining neighbors. Then, we make vertices in N A adjacent to v a and vertices in N B adjacent to v b . We note that N other = ∅ if X = S and thus there are no extra subtrees. In the case where X = S, N other contains the roots of the extra subtrees adjacent to v and we handle them in four different cases to make T include the desired bipartitions: those vertices that root extra subtrees in T R * (π) are moved onto the edge (v a , v b ) (by subdividing the edge to create new vertices, and then making these vertices adjacent to the new vertices) those vertices that root extra subtrees in T RS(A) are made adjacent to v a those that root extra subtrees in T RS(B) are made adjacent to v b the remaining vertices can be made adjacent to either v a or v b Algorithms 1 and 2 also use two data structures (functions) H and sv: Lemma 2 formally states that the tree T init we build in line 6 of Exact-RFS-2 (Algorithm 1) maximizes the p Y (·) score. This lemma is true because there are only |Π Y | bipartitions that can contribute to p Y (·) and T init contains all of them by construction. We define the function w * : Π → N ≥0 as follows: i∈ [2] w(e i (π)) else.
Lemma 3 shows that w * (π) represents the maximum potential increase in p X (·) as a result of adding bipartition π to T | X . The proof of Lemma 3 follows the idea that for any bipartition π of X, there are at most w * (π) edges in either T 1 or T 2 whose induced bipartitions become π when restricted to X. Therefore, by only adding π to T | X , at most w * (π) more bipartitions get included in C(T | S1 ) or C(T | S2 ) so that they contribute to the increase of p X (T ). The proof of Lemma 4 uses Lemma 3 repeatedly by adding the compatible bipartitions to the tree in an arbitrary order. Proposition 1. LetT be the tree constructed after line 11 of Algorithm 1, then p X (T ) = w * (Triv).
The proof naturally follows by construction (Line 8 of Algorithm 1), and implies that the algorithm adds the trivial bipartitions of X (which are all in I) to T | X so that p X (T ) reaches the full potential of adding those trivial bipartitions.
Lemma 5. Let T be a supertree computed within Algorithm 1 at line 14 immediately before a refinement step. Let π = [A|B] ∈ NonTriv ∩ I. Let T be a refinement of T obtained from running Algorithm 2 with supertree T , bipartition π, and the auxiliary data structures H and sv.
The idea for the proof of Lemma 5 is that for any non-trivial bipartition π ∈ I of X to be added to T | X , Algorithm 2 is able to split the vertex correctly and move extra subtrees around in a way such that each bipartition in T 1 or T 2 that is induced by an edge in P (e 1 (π)) or P (e 2 (π)), which is not in T | S1 or T | S2 before the refinement, becomes present in T | S1 or T | S2 after the refinement. Since there are exactly w * (π) such bipartitions, they increase p X (·) by w * (π).
Proposition 2. Let G be the weighted incompatibility graph on T 1 | X and T 2 | X , and let I be the set of bipartitions associated with vertices in I * , which is a maximum weight independent set of G. Let F be any compatible subset of C(T 1 , T 2 , X).
We now restate and prove Theorem 1: Proof. First we claim that p X (T * ) ≥ p X (T ) for any tree T ∈ T S , where T * is defined as from line 14 of Algorithm 1. Fix arbitrary T ∈ T S and let F = C(T | X ). Then by Lemma 4, p X (T ) ≤ w * (F ). We know that w * (π) = 0 for any (NonTriv ∩ I) ∪ Triv. Therefore, by Proposition 1 and Lemma 5, we have From Lemma 2 and the fact that a refinement of a tree never decreases p X (·) and p Y (·), we also know that p Y (T * ) ≥ p Y (T init ) ≥ p Y (T ) for any tree T ∈ T S . Since for any T ∈ T S , SF(T, A) = p X (T ) + p Y (T ), T * achieves the maximum split support score with respect to A among all trees in T S . Thus, T * is a solution to Relax-SFS-2-B (Corollary 1). If T * is not binary, Algorithm 1 arbitrarily resolves every high degree node in T * until it is a binary tree and then returns a tree that achieves the maximum split support score among all binary trees of leaf set S. See the Appendix for the running time analysis.

DACTAL-Exact-RFS-2
Let Φ be a model of evolution (e.g., GTR) for which statistically consistent methods exist, and we have some data (e.g., sequences) generated by the model and wish to construct a tree. We construct an initial estimate of the tree, divide the dataset into two overlapping subsets (by removing an edge e, letting X be the set of the p nearest leaves to e, and letting the subsets be A ∪ X and B ∪ X, where π e = [A|B]), re-estimate trees on the subsets (perhaps using a recursive approach that is statistically consistent), and then combine the trees together using Exact-RFS-2. We call this the DACTAL-Exact-RFS-2 pipeline, due to its similarity to the DACTAL pipeline [26].
Before we prove that DACTAL-Exact-RFS-2 can enable statistically consistent pipelines, we begin with some definitions. Given a tree T and an internal edge e in T , the deletion of the edge e and its endpoints defines four subtrees. A short quartet around e is a set of four leaves, one from each subtree, selected to be closest to the edge. Note that due to ties, there can be multiple short quartets around some edges. The set of short quartets for a tree T is the set of all short quartets around the edges of T . The short quartet trees of T is the set of quartet trees on short quartets induced by T . It is well known that the short quartet trees of a tree T define T , and furthermore T can be computed from this set in polynomial time [11][12][13]. Proof. Because T A and T B are true trees, it follows that T is a compatibility supertree for T A and T B . Furthermore, because every short quartet tree appears in at least one of these trees, T is the unique compatibility supertree for T A and T B (by results from [12,13], mentioned above). Finally, because T is a compatibility supertree, the RFS score of T with respect to T A , T B is 0, which is the best possible. Since Exact-2-RFS solves the RFS problem on two binary trees, Exact-2-RFS returns T given input T A and T B .
Thus, Exact-2-RFS is guaranteed to return the true tree when given two correct trees that have sufficient overlap (in that all short quartets are included). We continue with proving that these pipelines are statistically consistent.
Theorem 2. The DACTAL-Exact-RFS-2 pipeline is statistically consistent under any model Φ for which statistically consistent tree estimation methods exist.
Proof. Let Φ be the sequence evolution model. To establish statistical consistency of the pipeline, we need to prove that as the amount of data increases the tree that is returned by the pipeline converge to the true tree. Hence, let F be the method used to compute the starting tree and let G be the method used to compute the subset trees. Because F is statistically consistent under Φ, it follows that as the amount of data increases, the starting tree computed by F will converge to the true tree T . Now consider the decomposition into two sets produced by the algorithm, when applied to the true tree. Let e be the edge that is deleted and let the four subtrees around e have leaf sets A 1 , A 2 , B 1 , and B 2 . Note in particular that all the leaves appearing in any short quartet around e are placed in the set X. Now, subset trees are computed using G on A 1 ∪ A 2 ∪ X and B 1 ∪ B 2 ∪ X, which we will refer to as T A and T B , respectively. Since G is statistically consistent, as the amount of data increases, T A converges to the true tree on its leaf set (T | L(T A ) ) and T B converges to the true tree on its leaf set (T | L(T B ) ). When T A and T B are equal to the true trees on their leaf sets, then every short quartet tree of T is in T A or T B , and by Lemma 7, T is the only compatibility supertree for T A and T B and Exact-2-RFS(T A , T B ) returns T .
Hence, DACTAL+Exact-2-RFS is statistically consistent under all standard molecular sequence evolution models and also under the MSC+GTR model [33,45] where gene trees evolve within species trees under the multi-species coalescent model (which addresses gene tree discordance due to incomplete lineage sorting [19]) and then sequences evolve down each gene tree under the GTR model.
Note that all that is needed for F and G to guarantee that the pipeline is statistically consistent is that they should be statistically consistent under Φ. However, for the sake of improving empirical performance, F should be fast so that it can run on the full dataset but G can be more freely chosen, since it will only be run on smaller datasets. Indeed, the user can specify the size of the subsets that are analyzed, with smaller datasets enabling the use of more computationally intensive methods. For

Experiments and Results
We performed two experiments: Experiment 1, where we used Exact-2-RFS within a divide-and-conquer strategy for large scale phylogenomic species tree estimation where gene trees differ from the species tree due to Incomplete Lineage Sorting (ILS), and Experiment 2, where we used Exact-2-RFS as part of a greedy heuristic, GreedyRFS, for large-scale supertree estimation.

Experiment 1: Phylogenomic species tree estimation
In this experiment, the input is a set of gene trees that can differ from the species tree due to Incomplete Lineage Sorting [19], ASTRAL [23, 24, 51] is used to construct species trees on the two overlapping subsets in the DACTAL pipeline described above, and the two overlapping estimated species trees are then merged together using either Exact-2-RFS or FastRFS. Because the divide-and-conquer strategy produces two source trees, the RFS criterion score for Exact-2-RFS cannot be worse than the score obtained by FastRFS; here we examine the degree of improvement. The simulation protocol produced datasets with high variability (especially for small numbers of genes), so that there was substantial range in the optimal criterion scores for 25 and 100 genes ( Figure 4). On average, Exact-2-RFS produces better RFS scores than FastRFS for all numbers of genes, and strictly dominates FastRFS for 1000 genes (Fig. 4), showing that divide-andconquer pipelines are improved using Exact-2-RFS compared to FastRFS.

Experiment 2: Exploring GreedyRFS for supertree estimation
We developed GreedyRFS, a greedy heuristic that takes a profile A as input, and then merges pairs of trees until all the trees are merged into a single tree. The choice of which pair to merge follows the technique used in SuperFine [41] for computing the Strict Consensus Merger, which selects the pair that maximizes the number of shared taxa between the two trees (other techniques could be used, potentially with better accuracy [15]). Thus, GreedyRFS is identical to Exact-2-RFS when the profile has only two trees.
We use a subset of the SMIDgen [39] datasets with 500 species and varying numbers of source trees (each estimated using maximum likelihood heuristics) that have been used to evaluate supertree methods in several studies [27,[39][40][41]44]. See Appendix (in the full version of the paper on arXiv) for full details of this study.
We explored the impact of changing the number of source trees. The result for two source trees is predicted by theory (i.e., GreedyRFS is the same as Exact-2-RFS for two source trees, and so is guaranteed optimal for this case), but even when the number of source trees was greater than two, GreedyRFS dominated FastRFS in terms of criterion score, provided that the number of source trees was not too large (Fig. 5).
This establishes that the advantage in criterion score is not limited to the case of two source trees, suggesting that using Exact-2-RFS within GreedyRFS (or some other heuristics) may be useful for supertree estimation more generally.

Conclusions
The main contribution of this paper is Exact-2-RFS, a polynomial time algorithm for the Robinson-Foulds Supertree (RFS) of two trees that enables divide-andconquer pipelines to be provably statistically consistent under sequence evolution models (e.g., GTR [42] and MSC+GTR [33]). Our experimental study showed that Exact-2-RFS dominates the leading RFS heuristic, FastRFS, when used within divide-and-conquer species tree estimation using genome-scale datasets, a problem of increasing importance in biology. We also showed that a greedy heuristic using Exact-2-RFS produced better criterion scores than FastRFS when the number of source trees was small to moderate, showing the potential for Exact-2-RFS to be useful in other settings. Overall, our study advances the theoretical understanding of several important supertree problems and also provides a new method that should improve scalability of phylogeny estimation methods.
This study suggests several directions for future work. For example, although we showed that Exact-2-RFS produced better RFS criterion scores than Fas-tRFS when used in divide-and-conquer species tree estimation (and similarly GreedyRFS was better than FastRFS for small numbers of source trees in supertree estimation), additional studies are needed to explore its performance, including additional datasets (both simulated and biological datasets) and other leading supertree methods. Similarly, other heuristics using Exact-2-RFS besides GreedyRFS should be developed and studied.
the Ira and Debra Cohen Fellowship in Computer Science at the University of Illinois to SAC and EKM. This research is part of the Blue Waters sustainedpetascale computing project, which is supported by the National Science Foundation (awards OCI-0725070 and ACI-1238993) and the state of Illinois.

Notation for Exact-RFS-2
We assume T 1 and T 2 are two binary trees with leaf set S 1 and S 2 .
-X = S 1 ∩ S 2 : the shared leaf set -S = S 1 ∪ S 2 : the union of leaf set -Π = 2 X : the set of all bipartitions of X -C(T 1 , T 2 , X) := C(T 1 | X ) ∪ C(T 2 | X ): the set of bipartitions of X in the backbone trees of T 1 and T 2 -P (e): the path in T i from which a backbone edge e ∈ T i | X is obtained by suppressing degree-two vertices w(e) = |P (e)|: the weight of edge e ∈ E(T i | X ) e i (π): the edge in T i that induces the bipartition π ∈ C(T 1 , T 2 , X) w * (π) := i∈ [2] w(e i (π)): the weight of bipartition π ∈ Π, which is the sum of weights of edges in T i | X that induces π -T i − T i | X : the subgraph obtained by deleting all vertices and edges of the subgraph of T i induced on X -Extra(T i ) := {t | t is a component in T i − T i | X }: the set of extra subtrees of T i r(t): the root of extra subtree t, which is the unique vertex in V (t) that is adjacent to a vertex in the backbone tree -T R(e): the set of extra subtrees attached to e ∈ T i | X , i.e., the set of extra subtrees whose roots are adjacent to internal vertices of P (e) -T R * (π) := i∈ [2] T R(e i (π)): the set of extra subtrees that are attached to

General Theorems and Lemmas on Trees and Bipartitions
The following theorem and corollary gives alternative characterizations of compatibility between two bipartitions. We now provide a lemma and corollary that formalize the relationship between two distinct, yet closely related entities: bipartitions from a tree on leaf set R ⊆ S and bipartitions restricted to R from a tree on leaf set S.

Lemma 8. Let T ∈ T S and let π = [A|B] ∈ C(T ) be a bipartition induced by
∈ P (e ) for any e ∈ E(T | R ). 2. If R∩A = ∅ and R∩B = ∅, then for any π ∈ C(T | R ) induced by e ∈ E(T | R ), π| R = π if and only if e ∈ P (e ).
Proof. Let T R be the minimal subtree of T that spans R. It follows that the leaf set of T R is R and T | R is obtained from T R by suppressing all degree-two vertices.
(Proof of 1) We first claim that if R ∩ A = ∅ or R ∩ B = ∅, then e / ∈ E(T R ). Assume by way of contradiction that e ∈ E(T R ). There are then two non-empty components in T R − e. Since e induces [A|B] in T , the two components in T R − e have leaf set R ∩ A and R ∩ B, which contradicts the fact that one intersection is empty. Therefore, e / ∈ E(T R ). Furthermore, every edge e ∈ E(T | R ) comes from a path in T R . Since e / ∈ E(T R ), then e / ∈ P (e ) for any e ∈ E(T | R ). (Proof of 2) If R ∩ A = ∅ and R ∩ B = ∅, then e is required to connect R ∩ A with R ∩B in T (since e connects A with B). Thus, e is in any subtree of T spanning R; in particular, e ∈ E(T R ). Fix any π ∈ C(T | R ) induced by e ∈ E(T | R ). Note that the bipartition induced by P (e ) in T R equals the bipartition induced by e in T | R , i.e., π . For one direction of the proof, suppose e ∈ P (e ). Because internal vertices of P (e ) in T R do not connect to any leaves, the bipartition induced by the path P (e ) in T R equals the bipartition induced by any of its edges (in particular, e). Since e induces [A|B] On the other hand, if π| R = π , then π induces [R ∩ A|R ∩ B] in T | R . It follows that P (e ) also induces [R ∩ A|R ∩ B] in T R . Suppose e ∈ P (e * ) for some edge e * ∈ E(T | R ) such that e * = e . Then, by the previous argument, π e * = [R ∩ A|R ∩ B], which contradicts the fact that e * and e are different edges. Therefore, e ∈ P (e ).
The next corollary follows easily from Lemma 8.
Corollary 3. Let T be a tree with leaf set S and let π = [A|B] ∈ C(T ) be a bipartition induced by e ∈ E(T ). Let R ⊆ S such that R ∩ A = ∅ and R ∩ B = ∅. Then π| R ∈ C(T | R ).
In the following lemma, we characterize the vertex that we can split to add a compatible bipartition into a tree. An example can be seen in Figure 6.  Lemma 9. Let T be a tree with leaf set S. Let π = [A|B] be a bipartition of S such that π / ∈ C(T ), but π is compatible with C(T ). Then there exists a unique vertex v ∈ V (T ) such that no component of T − v has leaves from both A and B. Furthermore, we can split the neighbors of v, N T (v), into two sets N A and N B , where N A contains neighbors whose corresponding components contain a leaf in A and N B contains neighbors whose corresponding components contain a leaf from B. By replacing v with two vertices v a and v b , making v a adjacent to all the vertices in N A and v b adjacent to all the vertices in N B , and then adding an edge between v a and v b , we create a tree T such that C(T ) = C(T ) ∪ {π}.
Proof. By definition of compatibility, there exists a tree T such that To obtain T from T , we can replace v by two new vertices v a , v b with an edge between them. We also connect all vertices in N A to v a and all vertices in N B to v b . Then it is easy to see that (v a , v b ) induces π in T .

Proofs for Section 3
Lemma 1. Given an input set A of source trees, a tree T ∈ T B S is an optimal solution for RFS(A) if and only if it is an optimal solution for SFS(A).
Proof. Let N ≥ 2 be any integer. Let T 1 , T 2 , . . . , T N and S 1 , S 2 , . . . , S N be defined as from problem statement of RFS. Let T be any binary tree of leaf set S. Then T | Si is also binary and thus |C(T | Si )| = 2|S i | − 3. For any i ∈ [N ], we have Taking the sum of the equations, we have which is a constant. Therefore, for any binary tree T and any profile A of source trees, the sum of T 's RFS score and twice T 's split support score is the same, independent of T . This implies that minimizing the RFS score is the same as maximizing the split support score. Although this argument depends on the output tree being binary, it does not depend on the input trees being binary. Hence, we conclude that RFS and SFS have the same set of optimal supertrees. Lemma 2. For any tree T ∈ T S , p Y (T ) ≤ |Π Y |. In particular, let T init be the tree defined in line 6 of Algorithm 1. Then, p Y (T init ) = |Π Y |.
Let T init be the tree defined in Algorithm 1. We have the following proposition about p X (T init ), which is needed for the proof of Proposition 1.
Proof. For each v ∈ X, consider the bipartition π v = [{v} | S\{v}] of T init induced by the edge that connects the leaf v to the centerv. It is easy to see that π v | Si = [{v} | S i \{v}] ∈ C(T i ) for any i ∈ [2] as π v | Si is a trivial bipartition of S i . By construction, we also have π v | Si ∈ C(T init | Si ). We also know π v | Si ∈ Π X as both sides of π v have non-empty intersections with X. Thus, π v | Si ∈ C(T init | Si ) ∩ C(T i ) ∩ Π X for any i ∈ [2]. So for each v ∈ X, π v | S1 and π v | S2 each contributes 1 to p X (T init ). Therefore, p X (T init ) ≥ 2|X|.
Fix any bipartition π = [A|B] induced by any other edge e ∈ E(T init | Si ) for any i ∈ [2]. By construction of T init , e must be an edge in an extra subtree or connecting an extra subtree to the centerv, i.e., one component in T − e does contain any leaf of X. Therefore, either A ⊆ S\X or B ⊆ S\X, which implies π| Si / ∈ Π X for any i ∈ [2]. Hence, there is no other bipartition of T init such that when restrict to S i contributes to p X (T init ). Therefore, p X (T init ) = 2|X|. Proposition 1. LetT be the tree constructed after line 11 of Algorithm 1, then p X (T ) = w * (Triv).
Proof. Let π = [{a}|B] be a trivial bipartition of X. We know both e 1 (π) and e 2 (π) exist, and we abbreviate them with e 1 and e 2 . We number the extra subtrees in T R(e 1 ) as t 1 , t 2 , . . . , t p , where p = w(e 1 ) − 1, such that t 1 is the closest to a in T 1 . Similarly, we number extra subtrees in T R(e 2 ) as t 1 , t 2 , . . . , t q , where q = w(e 2 ) − 1, such that t 1 is the closest to a in T 2 . For each k ∈ [w(e 1 )], we define and for each k ∈ [w(e 2 )], we define It follows by definition that π k for any k ∈ [w(e 1 )] is the bipartition induced by the kth edge on P (e 1 ) in T 1 , where the edges are numbered starting from the side of a. This implies π k ∈ C(T 1 ) for any k ∈ [w(e 1 )]. Similarly, π k ∈ C(T 2 ) for any k ∈ [w(e 2 )]. In particular, we notice that π 1 = [{a}|S 1 \{a}] and π 1 = [{a}|S 2 \{a}]. Clearly, all these bipartitions (π k and π k for any k) are in Π X because both sides have none empty intersection with X.
Recall that Algorithm 1 moves all extra subtrees in T R * (π) onto the edge (v, a) and orders them in a way to add our desired bipartitions. In particular, the extra subtrees are ordered such that subtrees from T R(e 1 ) and subtrees from T R(e 2 ) are side by side and the attachments of T R(e i ) match their attachment on e i exactly (i.e., t 1 or t 1 , respectively, is closest to a). It is easy to see that as a result of such ordering of the extra subtrees, we have π k ∈ C(T | S1 ) for any k ∈ [w(e 1 )] and π k ∈ C(T | S2 ) for any k ∈ [w(e 2 )], where T is the tree obtained after adding π to the backbone through line 8 of Algorithm 1. Therefore, the algorithm increases |C(T | S1 ) ∩ C(T 1 | X ) ∩ Π X | by w(e 1 ) − 1, because π k / ∈ C(T | S1 ) before the step for all k ∈ [w(e 1 )] except k = 1 (since π 1 = [{a}|S 1 \{a}] ∈ C(T | S1 )). Similarly, the algorithm increases |C(T | S2 )∩C(T 2 | X )∩Π X | by w(e 2 )−1. Overall p X (T ) is increased by w(e 1 ) + w(e 2 ) − 2 = w * (π) − 2 by running one execution of line 8 in Algorithm 1 on T and π.
It is easy to see that line 8 of Algorithm 1 never destroys bipartitions of S 1 or S 2 already in T , so we have Lemma 10 proves that the auxiliary data structures of Algorithm 1 and 2 are maintaining the desired information and that the algorithm can split the vertex and perform the detaching and reattaching of the extra subtrees correctly. These invariants are important to the proof of Lemma 5.
Lemma 10. At any stage of the Algorithm 1 after line 12, we have the following invariants of T and the auxiliary data structures H and sv: 1. For any bipartition π ∈ NonTriv, sv(π) is the vertex to split to add π to C(T | X ). For any internal vertex v, the set of bipartitions H(v) ⊆ NonTriv is the set of bipartitions which can be added to C(T | X ) by splitting v. Proof. We prove the invariants by induction on the number of refinement steps k performed on T . When k = 0, we have T =T and T | X is a star with leaf set X and center vertexv. Thus all bipartitions in NonTriv are compatible with C(T | X ). For any π ∈ NonTriv,v is the vertex to refine in T | X to add π to C(T | X ). Therefore, it is correct that sv(π) =v for every π ∈ NonTriv and H(v) = NonTriv. The roots of all extra subtrees in T R * (π) for any π ∈ NonTriv are all neighbors ofv, so invariant 2 also holds. For any π ∈ C(T | X ) = Triv, let π = [{a}|B]. It is easy to see that since a is a leaf, T RS i ({a}) = ∅ and T RS i (B) = Extra(T i )\T R(e i (π)) for both i ∈ Assume that all invariants hold after any k < k steps of refinement. Let π = [A|B] be the bipartition to add in the kth refinement step. We will show that after the kth refinement step, i.e., one execution of Algorithm 2, the invariants still hold for the resulting tree T . Since v = sv(π) at the beginning of Algorithm 2, π can be added to C(T | X ) by splitting v. By Lemma 9, there exists a division of neighbors of v in T | X into N A ∪N B such that N A (or N B respectively) consists of neighbors of v which can reach vertices of A (or B) but not B (or A) in T | X −v. Then, the algorithm correctly finds N A and N B and connects N A to v a and N B to v b so the new edge (v a , v b ) induces the bipartition π = [A|B] in T | X . For any vertex u other than v and any bipartition π ∈ H(u), the invariants 1 and 2 still hold after Algorithm 2 as we do not change H(u), sv(π ), or the extra subtrees attached to u. For any bipartition π =∈ H(v) such that π = π, if π is not compatible with π, then it cannot be added to C(T | X ) since π is added, so the algorithm correctly discards π and does not add it to H(v a ) or H(v b ). If π is compatible with π, we will show that the invariants 1 and 2 still hold for π .
Fix any π = [A |B ] ∈ H(v) s.t. π = π and π is compatible with π. By Corollary 2, one of A and B is a subset of one side of [A|B]. Assume without loss of generality that A ⊆ A (other cases are symmetric). Then we have B ⊆ B . In this case, Algorithm 2 adds π to H(v a ) and set sv(π) = v a . We will show that this step preserves the invariants. Since π ∈ H(v), before adding π we can split v to add π to C(T | X ). Then there exists a division of neighbors of v in T | X into N A and N B such that N A (or N B , respectively) consists of neighbors of v which can reach vertices of gives an division of neighbors of v a such that N A are the neighbors that can reach leaves of A in T | X − v a and N B \N B ∪ {v b } are the neighbors that can reach leaves of B in T | X −v a . Such a division proves that v a is the correct vertex to refine in T | X to add π to C(T | X ) after the kth refinement. Therefore, invariant 1 holds with respect to π . Since π ∈ H(v) before adding π, we also have for all t ∈ T R * (π ), the root of t is a neighbor of v before adding π. Since A ⊆ A, π ∈ BP(A) and thus T R * (π) ⊆ T RS(A). Then, Algorithm 2 correctly attaches roots of all trees in T R * (π ) to v a . Therefore invariant 2 holds for π .
We have shown that invariants 1 and 2 hold for the tree T with the auxiliary data structures H and sv. Next, we show that invariant 3 holds. Since π is the only bipartition in C(T | X ) that is not in C(T | X ), we only need to show two things: i) for any π ∈ C(T | X ), the invariant 3 still holds, ii) invariant 3 holds for π. We first show i). Fix π = [A |B ] ∈ C(T | X ). Since π is compatible with π , by Corollary 2, one of A and B is a subset of one of A and B. We assume without loss of generality that A ⊆ A. Therefore, B ⊆ B . Let contains v and C(B ) contains u. Since t / ∈ T R * (π ), it cannot be attached to (v, u). Also by the invariant 3 with respect to π , t is not attached any vertex or edge in C(B ). Since this is true for every neighbor of v in N B , t / ∈ C(B) as C(B) consists of only edges connecting v to a neighbor u ∈ N B and the component containing u. Since t was not attached to v before the refinement, t is not attached to (v a , v b ) or C(B) after the refinement, then t must be attached to some edge or vertex in C(A). This proves invariant 3(a) for π and thus the inductive proof.
Lemma 5. Let T be a supertree computed within Algorithm 1 at line 14 immediately before a refinement step. Let π = [A|B] ∈ NonTriv ∩ I. Let T be a refinement of T obtained from running Algorithm 2 with supertree T , bipartition π, and the auxiliary data structures H and sv. Then, p X (T ) − p X (T ) = w * (π).
Proof. Since I corresponds to an independent set in the incompatibility graph G, all bipartitions in I are compatible. Since C(T | X ) ⊆ Triv ∪ (NonTriv ∩ I) = I, π ∈ NonTriv ∩ I must be compatible with C(T | X ), then there is a vertex to split to add π to C(T | X ). By invariant 1 of Lemma 10, v = sv(π) is the vertex to split to add π to T | X and thus Algorithm 2 correctly splits v into v a and v b and connects them to appropriate neighbors such that in T | X , (v a , v b ) induces π.
We abbreviate e 1 (π) and e 2 (π) by e 1 and e 2 . We number the extra subtrees attached to e 1 as t 1 , t 2 , . . . , t p , where p = w(e 1 ) − 1 and t 1 is the closest to A in T 1 . Similarly, we number the extra subtrees attached to e 2 as t 1 , t 2 , . . . , t q , where q = w(e 2 ) − 1 and t 1 is the closest to A in T 2 .
For any set T of trees, let L(T ) denote the union of the leaf set of trees in T . We note that if e i exists, Extra For each k ∈ [w(e 1 )], we define and for each k ∈ [w(e 2 )], we define We know that for each k ∈ [w(e 1 )], Thus, for any k ∈ [w(e 1 )], π k is the bipartition induced by the kth edge on P (e 1 ) in T 1 , where the edges are numbered from the side of A. Therefore, π k ∈ C(T 1 ) for any k ∈ [w(e 1 )]. Similarly, π k ∈ C(T 2 ) for any k ∈ [w(e 2 )].
Since for any k ∈ [w(e 1 )], A k ∩X = A = ∅ and (S 1 \A k )∩X = B = ∅, we have π k | X = π and π k ∈ Π X . Similarly, for each k ∈ [w(e 2 )], π k ∈ Π X and π k | X = π. We also know that since π / ∈ C(T | X ), by Corollary 3, π k / ∈ C(T | S1 ) for any k ∈ [w(e 1 )] and π k / ∈ C(T | S2 ) for any k ∈ [w(e 2 )]. We claim that π k ∈ C(T | S1 ) for all k ∈ [w(e 1 )] and π k ∈ C(T | S2 ) for all k ∈ [w(e 2 )]. Then assuming the claim is true, we have |C(T | S1 ) ∩ C(T 1 ) ∩ Π X | − |C(T | S1 ) ∩ C(T 1 ) ∩ Π X | = w(e 1 ) and , and thus p X (T ) − p X (T ) = w(e 1 ) + w(e 2 ) = w * (π). Now we only need to prove the claim. Fix k ∈ [w(e 1 )], we will show that π k ∈ C(T | S1 ). The claim of π k ∈ C(T | S2 ) for any k ∈ [w(e 2 )] follows by symmetry. By invariant 2 of Lemma 10, we know that all extra subtrees of T R(e 1 ) were attached to v at the beginning of Algorithm 2 and thus the algorithm attaches them all onto (v a , v b ) in the order of t 1 , t 2 , . . . , t p , such that t 1 is closest to A. Let the attaching vertex of t i onto (v a , v b ) be u i for any i ∈ [w(e 1 )]. Then we note P ((v a , v b )) is the path from v a to u 1 , u 2 , . . . , u p and then to v b . For any t ∈ T RS 1 (A), by invariant 3 of Lemma 10, t attached to C(A), the component . Therefore, if we delete any edge of P ((v a , v b )) from T , t is in the same component as A. Similarly, for any t ∈ T RS 1 (B), t is in the same component as B if we delete any edge of P ((v a , v b )) from T . In particular, consider T | S1 − (u k−1 , u k ). The component containing u k−1 and A contains all of T RS 1 (A) and {t i | i ∈ [k − 1]}, thus the leaves of that component is Therefore, the edge (u k−1 , u k ) induces the bipartition [A k |S 1 \A k ] in T | S1 . Hence, π k ∈ C(T | S1 ) as desired.
Proposition 2. Let G be the weighted incompatibility graph on T 1 | X and T 2 | X , and let I be the set of bipartitions associated with vertices in I * , which is a maximum weight independent set of G. Let F be any compatible subset of C(T 1 , T 2 , X). Then w * (I) ≥ w * (F ).
Proof. Let weight(U ) denote the total weight of any set U of vertices of G. We first claim that w * (I) = weight(I * ). Since all bipartitions in C(T 1 | X ) ∩ C(T 2 | X ) are compatible with all bipartitions in C(T 1 , T 2 , X), each of them become two isolated vertices in the weighted incompatibility graph, all of which are included in the maximum weight independent set I * . For each π ∈ C(T 1 | X ) ∩ C(T 2 | X ), the two vertices associated with it in G has total weight w(e 1 (π)) + w(e 2 (π)), which is exactly w * (π). For each π ∈ C(T 1 | X )∆C(T 2 | X ), the vertex associated with it also has weight exactly w * (π). Therefore, the weight(I * ) = w * (I).
Fix any compatible subset F of C(T 1 , T 2 , X). Let F = F \(C(T 1 | X )∩C(T 2 | X )), and let F be a set of vertices constructed by combining the vertices associated with F and the two vertices associated with each of π ∈ C(T 1 | X )∩C(T 2 | X ). Then Since F is compatible, F is also compatible, and thus F is an independent set in G. Therefore, weight(F ) ≤ weight(I * ), since I * is a maximum weight independent set in G. We conclude that w * (F ) ≤ weight(F ) ≤ weight(I * ) = w * (I). for any n ≥ 5. Therefore, T is not an optimal solution for Relax-SFS(A). Since Now consider any tree t = T with leaf set [n], and suppose t contains p bipartitions in Π and q bipartitions in 2 [n] \(Π ∪ Π [n] ) where p, q ∈ N. Since t = T , at least one of p and q is nonzero. Therefore, Since n ≥ 5 and both p and q are non-negative with at least one of them nonzero, we know the RFS score of t is strictly greater than that of T . Therefore, T is an optimal solution to Relax-RFS(A).
For (2), the analysis above shows that T (since it is a compatibility supertree for A) has the largest possible split support score. Hence, T is an optimal solution to the relaxed Split Fit Supertree problem. However, the RFS score for T is (n − 4)(n − 3), which is strictly larger than n − 3 for n > 5, and the RFS score for the star tree T is n − 3; hence, T is not an optimal solution for the relaxed RF supertree problem.
We show that the Split Fit Supertree problem and the Asymmetric Median Supertree (AMS) problem, which was introduced in [46] and which we will present below, have the same set of optimal solutions and thus the hardness of one implies hardness of another. The input to the AMS problem is a profile A = {T i | i ∈ [N ]} and the output is a binary tree Thus, T AMS minimizes the total number of bipartitions that are in the source trees and not in the supertree (i.e., T AM S minimizes the total number of false negatives).
Lemma 13. Given profile A = {T 1 , T 2 , . . . , T N } and S := i∈[N ] L(T i ), tree T ∈ T S is a Split Fit Supertree for A iff T is an Asymmetric Median Supertree for A.

Relationship between SMAST, SMCT, and RFS supertree problems
The SMAST and SMCT problems seek trees that are obtained after deleting minimal numbers of leaves from the input trees so that an agreement supertree or compatible supertree can be constructed from the reduced input trees. Here, we examine the possibility of using these output trees as constraint trees on the search for RFS supertrees, so that the removed taxa could be introduced into the constraint trees. We show that exact solutions to the SMAST and SMCT (Maximum Agreement Supertree and Maximum Compatible Supertree) problems are not directly relevent to solving the Robinson-Foulds supertree problem.

Lemma 14.
There exists a pair of binary trees T 1 and T 2 for which some optimal SMAST or SMCT supertree cannot be extended to any optimal RFS supertree through the insertion of missing taxa.
Note that at least one of A, B, C must be deleted to form an agreement supertree. Suppose C is deleted. Then ((A, z), ((B, x), (y, (D, E)))) is an optimal SMAST.
Observe that any way of adding C into this tree produces a supertree that has total RFS score greater than 2. Hence, for this pair T 1 and T 2 of input trees, for at least one optimal SMAST supertree, there is no way to extend that optimal supertree into an optimal RFS supertree.

Maximum Weight Independent Set in Bipartite Graphs
Given an undirected bipartite graph G, with vertices V = A ∪ B, edges E, and vertex weights w : V → N, the Maximum Weighted Independent Set problem tries to find a independent set I ⊆ V that maximizes w(I), where w(S) = v∈S w(v) for any S ⊆ V . It is well known (in folklore) that maximum weight independent set can be solved in polynomial time through reduction to the maximum flow problem. We reproduce a proof for completeness.
We first turn the graph into a directed flow network G = (V ∪ {s, t}, E ) where s and t are the newly added source and sink, respectively. To obtain E , we direct all edges in E from A to B, add an edge from s to each vertex u ∈ A and add an edge from each vertex v ∈ B to t. We set the capacities c : is an independent set, there is no edge from S ∩ A to T ∩ B. There is also no edge from S ∩ B to T ∩ A since edges in E are directed from A to B. Thus, the cut (S, T ) consists of only edges from s to T ∩ A and from S ∩ B to t. Together the capacities of those edges equal the weight of the set (S ∩ B) ∪ (T ∩ A), which is k.
For the other direction of the proof, suppose (S, T ) is an s, t-cut of finite capacity k. Since the cut has finite capacity, it does not contain any edge derived from E. In particular, there is no edge from S ∩ A to T ∩ B in G , which implies there is no edge between S ∩ A and T ∩ B in G. Since there is also no edge among S ∩ A and T ∩ B in G, (S ∩ A) ∪ (T ∩ B) is an independent set. Since the edges in (S, T ) solely consist of edges from s to T ∩ A and from S ∩ B to t, the sum of their capacities is k. Therefore, the weight of the set (S ∩ B) ∪ (T ∩ A) is k and the weight of (S ∩ A) ∪ (T ∩ B) is w(V ) − k.
Since w(V ) is a fixed constant, we conclude that any s, t-cut (S, T ) is a minimum cut in G if and only if (S ∩ A) ∪ (T ∩ B) is an maximum weight independent set in G. By the standard Max-flow Min-cut theorem, a minimum s, t-cut in a directed graph is equivalent to the maximum s, t-flow. Thus, we can solve the Maximum Weighted Independent Set problem on bipartite graphs using a maximum flow algorithm in polynomial time.

Experimental study
We present additional details about the experimental performance study.

Datasets
Experiment 1 To produce the datasets for the experiment on multi-locus datasets, we used SimPhy [20] to generate species trees and gene trees with 501 species under the multi-species coalescent model, producing a set of true gene trees that differ from the true species tree by 68% of their branches on average due to ILS [19]. The number of genes varied from 25 to 1000 with ten replicate datasets per number of genes.
For each replicate dataset, we used the model species tree and a technique similar to DACTAL [26] (described below) to divide the species set into two overlapping subsets, each containing slightly more than half the species. AS-TRAL [23, 24, 51] is a leading method for species tree estimation in the presence of ILS for large numbers of species, and so we used ASTRAL v5.6.3 (i.e., ASTRAL-III) to construct subset trees on the model gene trees, restricted to the relevant subset of species. Finally, the two ASTRAL subset trees were merged together using Exact-2-RFS and FastRFS. The following steps describe the procedure in details.
1. Identify a centroid edge (a, b) in the true species tree (i.e., an edge that, upon deletion, creates two subtrees T a and T b with leaf sets A and B of roughly equal size) 2. Let X be the set of 25 closest (in term of path distance on the weighted tree) leaves in T a to a and in T b to b 3. Let A = A ∪ X and B = B ∪ X 4. Restrict all 1000 true gene trees to leaf set A and use ASTRAL-III [51] to compute a tree A1 on the restricted true gene trees 5. Restrict all 1000 true gene trees to leaf set B and use ASTRAL-III to compute a tree B1 on the restricted true gene trees 6. Apply supertree methods FastRFS and Exact-2-RFS to input pair A1 and B1, and compare to the true species tree We vary the number of true gene trees by selecting the first 100 and 25 true gene trees from the datasets with 1000 true gene trees. Experiment 2 Each source tree is computed using maximum likelihood heuristics, with several clade-based source trees and a single scaffold source tree (i.e., species sampled randomly from across the tree). We selected the hardest of these 500leaf conditions, where the scaffold tree has only 20% of the leaves. Because all the source trees miss some leaves, the number of leaves per supertree dataset varied. The source trees were then given to FastRFS and GreedyRFS to combine into a supertree.
We use the first 10 replicates out of a total of 30 replicates. Note that since replicate number 8 requires combining two trees with less than 2 shared taxa, supertree construction does not make sense on this replicate. After eliminating this replicate, we end up with 9 replicates in total. To make inputs with k source trees, for k ∈ {2, 4, 6, 8, 10, 12, 14}, we take the first k source trees in each replicate. Since the first tree is always the scaffold tree, all of our replicates contain the scaffold tree. The average number of leaves per each source tree per replicate for these datasets is (rounded to the nearest integer) {87, 79, 78, 73, 74, 73, 73}, for each corresponding k.

Scripts and Commands
Our scripts and other utilities (developed by the authors of this paper) are available at http://github.com/yuxilin51/GreedyRFS.
-GreedyRFS on a set of source trees
-RFS criterion score To compute the RFS criterion score for a supertree T with respect to a profile A, we add the RF distances between T and every tree t ∈ A, as follows: compare_trees.py <tree1> <tree2>