Average-Tree Phylogenetic Diversity of Networks
Abstract
Phylogenetic diversity is a measure used to quantify the biodiversity of a set of species. Here, we introduce the “average-tree” phylogenetic diversity score in rooted binary phylogenetic networks and consider algorithms for computing and maximizing the score on a given network. Basically, the score is the weighted average of the phylogenetic diversity scores in all trees displayed by the network, with the weights determined by the inheritance probabilities on the reticulation edges used in the embeddings. We show that computing the score of a given set of taxa in a given network is #P-hard, directly implying #P-hardness of finding a subset of taxa achieving maximum diversity score and, thereby, ruling out polynomial-time algorithms for these problems unless the polynomial hierarchy collapses. However, we show that both problems can be solved efficiently if the input network is close to being a tree in the sense that its reticulation number is small. More precisely, we prove that we can solve the optimization problem in networks with leaves and reticulations in time. Using experiments on data produced by simulating a reticulate-evolution process, we show that our algorithm runs efficiently on networks with hundreds of taxa and tens of reticulations.
Keywords and phrases:
phylogenetic diversity, phylogenetic networks, network phylogenetic diversity, algorithms, computational complexityFunding:
Leo van Iersel: Partially funded by the Dutch Organisation for Scientific Research (NWO) grant OCENW.KLEIN.125 and OCENW.M.21.306.Copyright and License:
2012 ACM Subject Classification:
Applied computing Computational biology ; Theory of computation Fixed parameter tractability ; Theory of computation Problems, reductions and completenessEditors:
Broňa Brejová and Rob PatroSeries and Publisher:
Leibniz International Proceedings in Informatics, Schloss Dagstuhl – Leibniz-Zentrum für Informatik
1 Introduction
Human-driven habitat degradation and environmental change have led to accelerated rates of species extinction, posing a significant threat to global biodiversity [18]. In response, conservation efforts face both logistical constraints – such as limited financial resources – and conceptual challenges, notably in identifying effective prioritization strategies. This has motivated the development of formal criteria and quantitative indicators to guide decision-making and assess biodiversity in a systematic and reproducible manner. Here, we study formal measures for quantifying the biodiversity of a given set of species and algorithms for computing and maximizing these measures.
Phylogenetic Diversity (PD) is a well-studied measure used to quantify the biodiversity of a set of species using their evolutionary relationships [22]. When these relationships are described by a rooted phylogenetic tree (with edge lengths) on a set of species, the phylogenetic diversity of a subset of the species is simply the total length of all edges that lie on at least one path from the root of the tree to a leaf in [8]. The main intuition behind this is that species that are evolutionarily further apart together contribute more diversity than species that are evolutionarily closely related. Phylogenetic diversity can more formally be interpreted as follows in terms of features (e.g., genes or morphological features). Suppose that (1) the number of features introduced on each edge of the tree is proportional to the length of that edge, (2) features are never lost, and (3) no feature is introduced independently on multiple branches. Then, the number of features in is proportional to its phylogenetic diversity. While the assumptions on the model may seem somewhat unrealistic, it has been used prominently in the literature [31, 3, 5, 24].
In many cases, however, describing evolutionary relationships by trees is too simplistic [7]. Rooted phylogenetic networks generalize rooted phylogenetic trees and are more suited to represent complex evolutionary scenarios involving, for example, hybridization events [12, 1, 26]. Roughly speaking, a rooted phylogenetic network is a directed acyclic graph with a single root and labeled leaves. Vertices with two (or more, in the nonbinary case) incoming arcs are called reticulations and can be used to model hybridization events, lateral gene transfer, or other types of horizontal evolutionary events. Generalizing the notion of phylogenetic diversity to networks is not obvious, and multiple variants have recently been proposed for rooted [32, 3, 5] and unrooted [20, 2, 4] networks.
The currently best-studied variants are AllPathsPD and NetworkPD [3, 14, 27, 5, 24]. Both definitions generalize PD on trees, in the following ways. is the total length of all edges that lie on at least one path from the root of the network to a leaf in . NetworkPD generalizes AllPathsPD by allowing arcs to have inheritance probabilities, representing the probability that a given feature in the parent is inherited by the child. The idea of is, roughly speaking, to compute the expected number of distinct features in by using the inheritance probabilities to compute, for each arc , the probability that at least one copy of a feature introduced on the arc is inherited by a leaf in . More precisely, is defined as times the inheritence probability of if is a reticulation with child , as if is a tree node with children , and as or if is a leaf, depending on whether or not .
Note that, in the NetworkPD setting, features are inherited independently from different parents. For example, in the network in Figure 1, a feature introduced on the arc is present in all parents of the reticulation above species but it has only a probability of being inherited by . In addition, in NetworkPD it is possible to inherit the same feature from both parents, resulting in the presence of multiple copies of that feature. While this can be realistic in certain scenarios (e.g. allopolyploidy), in other cases it is more realistic to assume that each feature is inherited from at most one parent.
A less studied alternative averages the phylogenetic diversity score over all trees displayed by the network [31]. Each tree has an assigned probability equal to the product of the inheritance probabilities of the reticulation edges used by its embedding, and the PD-score of each tree is weighted by this probability. Thus, the resulting measure can be interpreted as the expected diversity of any tree displayed by the network. In this model, features inherited from different edges incoming to the same reticulation vertex are not considered to be independent. The assumption is basically that each feature is inherited from exactly one parent (or introduced on that edge). Note however that Wicke and Fischer [31] use an uncommon definition of displayed trees, in which trees are only considered to be displayed if they contain all vertices of the network and all their leaves are in . In particular, a network that is not tree-based [13] would have no displayed trees and an undefined phylogenetic diversity score. Since we see no reason to ignore certain displayed trees, we will take the (weighted) average over all displayed trees in this paper. We call the resulting phylogenetic diversity measure on phylogenetic networks the average-tree phylogenetic diversity, APD for short, see Figure 1 for an example.
While APD is a natural and biologically relevant generalization of phylogenetic diversity from trees to networks, we show that it comes with great challenges. Although for most PD measures on networks, such as AllPathsPD and NetworkPD, the diversity score of a given set of species can be easily computed in polynomial time, we will show in this paper that for APD this is already computationally hard (more precisely, #P-hard; a polynomial-time algorithm for it would imply a collapse of the polynomial hierarchy.) Consequently, the maximization problem associated to APD is also computationally hard. This problem, called Max-APD, aims to find a set of species with maximum APD score.
On the positive side, we show that, although both these problems are #P-hard, they can both be solved exactly and rather efficiently in practice. We do this by presenting a parameterized algorithm finding a size- set of leaves that maximizes the APD score in binary networks with reticulations and leaves in time. This algorithm can easily be adapted to compute the score of a given set . The running time of our algorithm scales linearly with the number of species but exponentially with the number of reticulations in the network. Using practical experiments on simulated data we show that the algorithm can solve instances with up to species and reticulations within minutes on 16 GB RAM and 8 CPUs computer, with . In addition, the algorithm efficiently solves instances with up to at least if the number of reticulations is fixed as .
1.1 Related Literature
There is a rich body of literature on phylogenetic diversity on (rooted and unrooted) phylogenetic trees. Most relevant to our work, the optimization problem of finding species with maximum phylogenetic diversity can be solved in polynomial time on (rooted and unrooted) trees [21, 25]. Further research found fast algorithms on trees, where biological dependencies were considered also [19, 23, 16]. However, only a few papers have studied phylogenetic diversity on phylogenetic networks [32, 3, 15, 27].
Wicke and Fischer [31] introduced several generalizations of phylogenetic diversity of rooted trees to rooted networks, including AllPathsPD and a variant of APD (as discussed above). The question of efficient computability is not addressed, nor are the corresponding maximization problems. Instead, the authors define several biodiversity indices on phylogenetic networks, which can be used to rank species for conservation. The associated maximization problems are first addressed by Bordewich et al. [3], showing that the decision versions of the maximization problems of AllPathsPD and NetworkPD are NP-hard, even for tree-child networks but, on level-1 networks,111The level of a network is the maximum reticulation number of any biconnected component. It may be arbitrarily smaller than the reticulation number of the network. a solution maximizing the AllPathsPD score can be found in polynomial time. They also introduce two more variants, MaxWeightTree-PD and MinWeightTree-PD. In these, the of taxa is the maximum and minimum weight of the phylogenetic diversity of in any display tree. While for the former, the maximization problem is polynomial-time solvable, for the latter, computing the score of a given subset of the species is already NP-hard.
The AllPathsPD problem was shown to also be susceptible to greedy approaches, as long as the input network is semi-binary (maximum in-degree two) and has level111The level of a network is the maximum reticulation number of any biconnected component. It may be arbitrarily smaller than the reticulation number of the network. at most two [5].
Jones and Schestag [14] studied the maximization problem for AllPathsPD in more detail. They showed that it is W[2]-hard when the parameter is the number of species to save, but fixed-parameter tractable (FPT) when the parameter is the optimal phylogenetic diversity, the acceptable loss of phylogenetic diversity, the number of reticulations in the network, or the treewidth of the underlying graph. Recently, this measure has been considered in semidirected networks [11].
2 Preliminaries
Phylogenetic Networks.
In this paper, we consider binary, rooted phylogenetic networks with integer weights on arcs, in which each reticulation has an associated probability distribution on its incoming arcs. Formally, for a set of taxa, a binary, rooted phylogenetic network on , later only called phylogenetic network, is a tuple satisfying the following conditions:
-
forms a directed acyclic graph with vertices and arcs , in which is the set of leaves (vertices with in-degree and out-degree ), and for which
-
–
There is a single vertex of in-degree and out-degree , called the root ;
-
–
Vertices of in-degree and out-degree are called the reticulations and denoted ;
-
–
The remaining vertices have in-degree and out-degree and are called the tree nodes.
An arc is a reticulation arc if and is the set of reticulation arcs.
-
–
-
is a bijective mapping of leaves of to taxa in and is called labelling of . We refer to a leaf of interchangeably with the taxa it is labeled with.
-
assigns each arc a non-negative integer, called the (arc-)weight of .
-
is a function that assigns each reticulation arc a real value between and , subject to the constraint that for any reticulation with incoming arcs and . We call the inheritance probability of . Informally, represents the probability that a randomly-selected gene in is inherited from .222We explicitly point out the difference to NetworkPD, where represents the proportion of genes in that are inherited by [3, 27]. Here, is the probability that a randomly chosen switching has . Consequently, gives the probability that anything is inherited along .
We often refer to the graph as and we refer to as a vertex of and as an arc in . For such an arc , we call a parent of , and a child of . For a vertex of a phylogenetic network, the offspring of is the set of leaves reachable from with a direct path. We will permit ourselves to write for and for . A phylogenetic tree is a phylogenetic network with no reticulations.
Phylogenetic Diversity.
For a phylogenetic tree on , the phylogenetic diversity of a set is the total weight of arcs that are on a path from the root to a leaf in [8]. In this section, we generalize this measure to phylogenetic networks. For an example consider Figure 2.
A switching is a function that maps each reticulation to one of its incoming arcs. That is, a switching corresponds to a decision about which incoming arc each reticulation will inherit from. The set of switchings in is denoted by . The probability of a switching is . That is, we assume that for each reticulation, one of its incoming arcs is chosen (with probability given by ) independently at random.
The switching-tree for a switching is the subgraph of derived by deleting all reticulation arcs except for those in . Observe that is a directed tree, possibly with some “dead” (unlabelled) leaves and vertices of degree . Note that is a subset of the leaves of , for each switching . Given a switching and a set of taxa , we define as under the usual definition of phylogenetic diversity on trees.
Now, we can define the measure we study in this paper. For a phylogenetic network and a set of taxa, the average phylogenetic diversity of with respect to is
Informally, is the amount of phylogenetic diversity that we expect to preserve by saving , under the assumption that each reticulation independently chooses one of its parents to inherit from. Note that this is equivalent to the definition given in the introduction based on displayed trees since suppressing degree- vertices and deleting unlabelled leaves does not affect the phylogenetic diversity score of a tree. The associated decision problem is as follows.
Max-APD Input: A phylogenetic network and integers and . Question: Is there a set of taxa which has a size of and with ?
3 #P-Hardness of computing APDN when saving all taxa
In the following, we prove that computing is #P-hard, even if all taxa are being saved, in contrast to other phylogenetic diversity measures on networks like NetworkPD [3, 27] or AllPathsPD [3, 14, 16], that can be computed (not optimized) in polynomial time.
Theorem 3.1.
Computing for a phylogenetic network on is #P-hard.
We prove the theorem by reduction from Counting Perfect Matchings. A perfect matching of an undirected graph is a set of edges such that each vertex of is incident with exactly one edge of . Counting the number of perfect matchings in a cubic graph is #P-hard [6, Theorem 6.2].
We first observe that the number of perfect matchings in a graph is the same as the number of mappings that are perfect assignments, as defined next. A mapping is an assignment if is an incident edge of for all . An assignment is perfect if . As for each edge , an assignment is perfect if and only if for each . That is, for each , a perfect assignment assigns either both or none of the incident vertices to . Thus, mapping each perfect assignment to constitutes an injection from perfect assignments to perfect matchings in . On the other hand, if is a perfect matching in , then a perfect assignment can be defined from by setting to be the unique edge in incident to for all . It is easy to see that this mapping is an injection from the set of all perfect matchings in to the set of perfect assignments. Thus, the perfect matchings in are in bijection with its perfect assignments, so their number is the same. In the following, we let denote this number.
Given an arbitrary cubic graph , we construct a network on a set of taxa in such a way that can be determined from . See Figure 3 for an example.
Let , that is, has one element for each vertex in . Now, for each , add two vertices and , where is a parent of , and is a parent of . For each , add a vertex to . For any vertex , choose one of its three incident edges arbitrarily (recall that is cubic), and add an arc . Then, for the remaining two edges and incident with , add the arcs and . Observe that, so far, has out-degree and in-degree for each , and and are reticulations for all . Furthermore, has a directed path to in if and only if is incident to in .
Create two new special vertices and and add an arc (this will be the only arc that gets a large weight). Add a root as the parent of . Now, for each , add a new vertex as the parent of . For each vertex , add two parents and . Add arcs such that and form paths. We now have that each vertex is a reticulation, and exactly one of the incoming arcs of is on a directed path from the root that passes through . In Figure 4 an illustration is given.
We define the arc weights by setting for a large value to be defined later, and setting for all other arcs .
It remains to define the inheritance probabilities on the reticulation arcs.
First, consider for some . Then, set and , where is a small constant to be defined later. Note that, when we choose with probability then, for each , the probability that has a path to some is exactly .
Now, consider and for some . Then, set , , and set , where is the other arc incoming at and and are the two arcs incoming at . The idea here is that when we choose a switching and corresponding tree with probability , each vertex for adjacent to has probability of having a path to in . As exactly one such will have a path to , we can think of the choice of as corresponding to a choice of assignment (where is mapped to in the assignment if has a path to in ). Our choice of inheritance probabilities ensures that each of the three edges incident to is chosen with probability , independently for each , and so each possible assignment is chosen with equal probability. This completes the reduction.
Given a switching , let if has a path to some leaf in , and otherwise. Thus, the expected contribution of to is . We first show that this value depends entirely on the possible assignments of .
Lemma 3.2.
.
Proof.
We first fix some notation. A partial switching on is the restriction of a switching to the subset of reticulations. Let denote the set of all partial switchings on . For a partial switching , let .
Let be the set of reticulations , and let be the set of reticulations . If is a switching and and are the restrictions of to and respectively, then we write . Observe that in this case and that and therefore .
For a partial switching on , we can associate an assignment on , as follows. For any , let , , and be the incident edges of in , and without loss of generality assume that is a parent of and the parents of are and . We say a partial switching and assignment agree on if one of the following holds: 1. and ; 2. and , ; or 3. and , . That is, is the unique edge incident to for which the switching chooses a set of reticulation arcs that connect to . See Figure 5. An assignment and a partial switching are associated if and agree on all .
For assignments on , let denote the set of all partial switchings associated with . We use these sets to decompose the sum , as follows:
The next two claims allow us to express the above in terms of for all assignments .
Claim 3.3.
, for each assignment on and .
Proof.
Let for some and recall that if and only if has a path to some leaf in . Any such path must pass through some , and only has a path in to some if is assigned to by i.e. . Thus, if and only if has a path in to for some . This occurs if and only if for some . Or, equivalently, if and only if for all . It follows
| as, roughly speaking, the sum of the probabilities of partial switchings that choose the -arcs equals the probability that a random switching makes all these choices, this equals | ||||
Claim 3.4.
For any assignment on , it holds that .
Proof.
Recall that, for any and any assignment , we have if and only if agrees with on all . Recall also that for any , whether and agree on depends only on , and . So suppose we choose a partial switching with probability . Note that the probability that is .
Now consider the probability that agrees with on some . If for the edge such that , then agrees with on if and only if , and this occurs with probability . Now, if where is one of the parents of , then agrees with on if and only if and , and this occurs with probability . In either case, and agree on with probability . As these agreements are independent for each , the probability that agrees with on all is . But this is exactly the probability that , which is . By ˜3.3 and ˜3.4, we have
Lemma 3.2 gives us an expression for the expected contribution , in terms of the values for all assignments on . We next show that this value can be restricted to a certain range depending on on the number of perfect assignments. For this next step, we now fix the value of such that . Let .
Lemma 3.5.
.
Proof.
From Lemma 3.2 we have that . As there are exactly assignments on , this is equivalent to
Recall that for each assignment , with equality if and only if is a perfect assignment. Then, counting only the perfect assignments,
proving the first inequality.
For the second inequality, observe that for any imperfect assignment . (As is cubic, is even). Then, counting perfect and imperfect assignments separately, and using the fact that there are at most imperfect assignments, we have
By what we have seen so far, the expected contribution of arc to is between and , where is the number of perfect assignments (and therefore the number of perfect matchings) on . It remains to show that this term dominates the other terms in , such that is also bounded by a range dependent on . For this, we fix such that . That is, the weight of is more than times the total weight of all the other arcs in .
Lemma 3.6.
We have
or equivalently, .
Proof.
First, consider the tree associated with an arbitrary switching . If has a path to a leaf in (i.e. if ) then has an offspring in , and so . Thus . As the weight of all other arcs is , we also have that . Then, by Lemma 3.5,
| and | ||||
Given Lemma 3.6, we can count the number of perfect matchings in a cubic graph as follows. Construct the network on as described above in polynomial time, then compute . Determine the unique integer for which we have . By Lemma 3.6, is the number of perfect matchings in .
As counting perfect matchings is #P-hard, it follows that computing is also #P-hard. This completes the proof of Theorem 3.1.
4 Maximizing Average-Tree Phylogenetic Diversity
In this section, we show that Max-APD is fixed-parameter tractable (FPT) with respect to the number of reticulations. For the sake of brevity, we restrict our theoretical proof to simple networks – that is, has no cut-arcs except for those incident to degree-1 vertices. However, we see no reason why this result could not be extended to non-simple networks. Indeed, our implementation, see Section 5, can deal with arbitrary networks.
We make use of the notion of generators, as introduced in [28]. This gives us a certain useful partition of the leaves of a network into sides. At the high level, our algorithm “guesses” the optimal solution by guessing which sides of the network contain a leaf from . As the number of sides is , there are at most such guesses to consider. Once it has been decided which sides contain a leaf from the solution, the problem reduces to a problem on forests, which can be solved in polynomial time.
In what follows, let be a network and let , and . That is, , , and denote the number of reticulations, vertices, and arcs, respectively.
4.1 Characterizing solutions in terms of generator sides
Definition 4.1.
Let be a simple network. The generator of is the directed multigraph obtained from by (1) deleting nodes with in-degree one and out-degree zero (leaves) and (2) suppressing all nodes with in- and out-degree one. The arcs and in-degree-2 out-degree-0 vertices of are called sides. The arcs are also called arc sides, and the in-degree-2 out-degree-0 vertices are also reticulation sides. We say that a leaf is on side (or that side contains ) if either
-
is a reticulation side of and the parent of in , or
-
is an arc side of , obtained by suppressing in-degree-1 out-degree-1 vertices of a path in and the parent of in lies on .
In addition, if is an arc side of , obtained by suppressing in-degree-1 out-degree-1 vertices of a path in , then we say that every arc in is a path-arc of side . We call the lowest arc in the lowest path-arc of . Observe that every arc in is either a leaf-arc, or a path-arc of some side.
For an arc side , let denote the subgraph of whose vertices are the leaves on side together with the vertices incident to any path-arcs of , and whose arcs are the leaf-arcs incident to the leaves of together with the path-arcs of . Note that is a tree (though not necessarily a phylogenetic tree). We call the side-tree of .
The generator of a simple binary network with reticulations has at most reticulation sides and at most arc sides [9].
In order to find solutions for Max-APD on simple networks, we characterize possible solutions in terms of their relation to the generator of .
Definition 4.2.
For a simple phylogenetic network on , let denote the sides of the generator of . For , the signature of is the set of all sides that contain at least one leaf of .
Now, to find a solution for Max-APD, it is enough to compute, for each signature , the maximum value of over all sets with signature and .
4.2 Contribution of lowest path-arcs
For an arc in , a set of taxa and a switching , define to be if has an offspring from in , and otherwise (if is not an arc in then define to be ). Then define
Intuitively, corresponds to the expected probability that has an offspring in under a randomly selected switching. In the next lemma, we show that gives the contribution of arc to .
Lemma 4.3.
.
Proof.
By definition, . Thus,
Thus, to compute , it is enough to compute for each arc . The next lemma shows that for certain arcs, the value of depends only on the signature of . This will be important for the construction that follows.
Lemma 4.4.
Let be the lowest path-arc of a path side and let be two subsets of with the same signature. Then .
Proof.
By the definition of , it suffices to show that for each switching . So suppose that has an offspring in such that . Let be the side containing . As and have the same signature, also contains some leaf . Since and are on the same side and is an offspring of the arc in , also is an offspring of in . Thus is an offspring of in , and so . By a symmetric argument, from we conclude . Thus for all and we can conclude that .
For a lowest path-arc and signature , define to be for any with with signature . By Lemma 4.4, this is well-defined.
Lemma 4.5.
For any lowest path-arc and any , the value of can be computed in time. Consequently, can be computed in time for each lowest path-arc and signature .
Proof.
Observe that as each reticulation has two incoming arcs and a switching maps each reticulation to one of its incoming arcs, there are exactly switchings. For each switching , the value of for all that are lowest arcs of a generator arc side, can be computed in time by a bottom-up traversal in the generator of , registering whether each such has an offspring from in . Adding up all values of (to compute ) can therefore be done in time.
To compute , it is sufficient to construct a set with signature by picking an arbitrary leaf from each side . As , we can compute this value in time.
4.3 Reduction to Forests
Let be an instance of Max-APD. Fix a signature and assume there exists a solution with We now show how to reduce Max-APD under this assumption to a related problem on forests.
Let be non-empty and pairwise disjoint subsets of , and let be a forest of -trees . Define to be the union . For technical reasons, we allow to have non-integer arc weights and degree-2 vertices. A set respects if contains a taxon of each for . For a set , define . For a forest and two integers , we define a decision problem Max-PD in which we are asked whether a set exists such that has a size of , , and respects . This definition generalizes the classic definition of Max-PD on trees [8].
Construction 1.
Initialize , , and let be an empty set. Then for each side in do the following:
-
If is a reticulation side, then let be the single leaf on this side and let be its incoming arc. If , then increase by , and add to , where is a tree with a single leaf and single arc of weight .
-
If is an arc side, then let denote the lowest path-arc of . For each path-arc of (including ), add to . Then if , add to , where is derived from the side-tree as follows: Multiply the weight of each path-arc by , then delete the lowest path-arc .
Lemma 4.6.
Let be a subset of sides and let be derived from according to Construction 1. Then for , the signature of is if and only if respects . Moreover, if , then .
Proof.
Observe that by construction, each tree in corresponds to a side , and as such respects if and only if .
So now assume that . We prove that .
Note that every arc in has a corresponding arc in . For ease of notation, we speak of the same arc as existing in and . Let denote the weight of an arc in , and let denote the weight of in . For any arc that is not a lowest path-arc in , let if has an offspring from in , and let otherwise. Observe that .
Recall that denotes the probability that, taken some switching , the arc has an offspring from in . By Lemma 4.3 we know .
Claim 4.7.
For an arc side with lowest path-arc , and a path-arc of that is not , if then and, otherwise, .
Claim 4.8.
For a leaf-arc with on side , it holds that if , and , otherwise.
Recall that, for any lowest path-arc on side , it holds that by definition. Let denote the arc sides on and let denote the reticulation sides. Note that . By construction it is .
To prove the main claim, let denote the leaf-arcs of , let denote the path-arcs which are not the lowest path-arc on their side, and let denote the arcs which are the lowest path-arc of their side. For each , let denote the arcs in belonging to a side in , and let . Note that the arc set of is . For each arc , let denote the lowest path-arc on the side containing (note that if then ). Due to space constraints, is proven in the appendix.
Finally, we show that Max-PD can be solved efficiently on forests. For this, we use Faith’s famous greedy algorithm on trees [8, 21, 25].
Lemma 4.9.
Max-PD can be solved in time on forests.
Due to space constraints, Lemma 4.9 is proven in the appendix.
4.4 Putting everything together
For any signature , let denote the pair derived for according to Construction 1. Lemma 4.6 implies that , for any with signature . As every has a signature, it follows that is a Yes-instance of Max-APD if and only if is a Yes-instance of Max-PD for some . We can therefore solve Max-APD on using the following algorithm: For each in turn, construct and solve Max-PD on . If the answer is Yes for any Max-PD instance, return Yes. Otherwise, return No.
As the generator of has at most sides [9], there are at most signatures to consider. For each signature and each lowest path-arc , the value of can be computed in time by Lemma 4.5. After these values are computed, can be constructed in time. Finally, solving Max-PD on can be done in time by Lemma 4.9. Thus, the total running time of the algorithm is in .
5 Experiments
5.1 Data
Phylogenetic networks have been simulated333All data is publicly available at https://sdrive.cnrs.fr/s/aEFNN3iRp7goQBT. via the package SiPhyNetwork [17]. This package allows simulations of phylogenetic networks as well as traits developing along the networks, giving several customization options. In the following, we describe the choices made for our simulations, and we refer to the SiPhyNetwork paper for more details [17]. We simulated phylogenetic networks with 1, 20, …, 100, 150, … 500 leaves using the generalized sampling approach [10] implemented in SiPhyNetwork, with 100 simulation replicates, speciation rate equal to 1, extinction rate equal to 0.6, and hybridization rate varying from 0.1 to 0.001 for getting a reasonable reticulation number, even for large instances (see distributions of instances in Figure 6).
In our simulations, we aimed at simulating hybridizations between plants and, for doing so, we opted for the “lineage generative hybridizations” implemented in SiPhyNetwork, with inheritance probability fixed to and a trait model where hybridization occurs only among species with equal ploidy. The rate of autopolyploidy has been fixed to 0.05. In this simulation setting, the number of leaves and the number of reticulations are correlated, so it is unlikely to produce large networks with small reticulation number or small networks with large reticulation number.
5.2 Implementation Details
The algorithm presented in Section 4 was implemented in C++ in a serial manner (no multithreading) in the framework of the phylo-tools library [30] (both published under the open source CeCILL Free Software License 2.1) with a few heuristic improvements that we discuss in the following.
Guesses on the Generator.
Instead of “guessing” which of the sides of the generator have tree-paths to selected leaves as stated in Construction 1, it is sufficient to guess which nodes of the generator have tree-paths to selected leaves. The nodes are pre-filtered, so the guess is done only on nodes that actually have tree-paths to leaves in the input network. Then, all subsets of size at most of such nodes are enumerated. Note that those are polynomially many if is constant (with the degree depending on ).
Enumerating Switchings.
Given the information which generator nodes have tree-paths to selected leaves, we can group the possible switchings in order to avoid enumerating all of them. Indeed, if contains no leaf below some reticulation , then the switching of its reticulation arcs has no bearing on the value of for any . This allows us to restrict enumeration to switchings involving reticulations above leaves in . If the input network is mostly “flat”, that is, most paths contain only constantly many reticulations, then we expect the enumeration-step to take , rather than steps. For fixed , we would therefore expect polynomial running time (with the degree depending on ).
Keeping Multiple Solutions.
In order to assess the quality of our diversity measure and compare it to other phylogenetic diversity measures, as well as the actual feature diversity, it would be helpful to output not only one, but all best-scoring solutions, and maybe even some sub-optimal solutions. To accommodate this, we store solutions in a sorted array that retains a given number of solutions and automatically discards solutions with a lower score.
Implementation of Greedy Selection.
Once the values are computed for all , we set up a structure mapping each node of the network to a list of its children sorted by diversity attainable by saving a leaf with a tree-path from that avoids generator nodes. Then, the best entry is selected, the pointers are followed down to the best leaf , and the structure is updated from upwards until a generator node is encountered. While we incur a factor of (where is the maximum out-degree in the network), this degenerates to since our input networks are binary. Note that we avoid explicitly constructing the forest on which we run the greedy algorithm.
5.3 Experimental Results
While the theoretical running time of may not seem convincing, the implementation performs satisfactorily in practice, possibly due to the heuristic improvements discussed in Section 5.2.
Reticulation number .
Figure 7 (left) shows that the observed dependence on the reticulation number for behaves very similarly to the reference function . For , this seems to confirm our hypothesis that the running time can be bounded in on most networks. While the real-time setting (up to 17ms per run) ends rather early at , we can cover up to in under 5 minutes, even for networks with many leaves ().
Solution size .
Figure 7 (right) shows a clear partition of growth scenarios for the running time dependent on with and fixed: For , the running time grows exponentially, roughly proportional to but for , its growth behaves as a linear function in . Again, this is characteristic for functions in .
Each of the instances with , finished within 8 seconds for each .
6 Discussion
Multiple formulations of the concept of phylogenetic diversity on phylogenetic networks have been proposed recently, and it is not yet clear which variants are most relevant in practice, for example for conservation biology. In our opinion, the average-tree phylogenetic diversity score (APD) appears to be a very natural and biologically relevant generalization of phylogenetic diversity on trees. In addition, in the small toy example in Figure 1, APD seems to make a more reasonable choice than NetworkPD by selecting the hybrid species. However, to properly compare the different variants of phylogenetic diversity and to compare them to the feature diversity [32], extensive simulations and multiple case studies are needed. We believe that this is a very important topic for further research.
From a computational point of view, an interesting open problem that remains is whether our FPT algorithm for the maximization version of APD can be improved to a parameterized algorithm with respect to the level of the network. This would be interesting since the level can be much smaller than the current parameter, the number of reticulations. We conjecture that, in contrast to NetworkPD (which is NP-hard for level-1 networks [27]), this is possible for APD. Another interesting direction would be to extend the algorithm to nonbinary networks.
Finally, while the theoretical running time of our algorithm is , experiments suggest that the practical running time behaves like on networks generated by evolutionary processes. It could be interesting to study this further.
References
- [1] Eric Bapteste, Leo van Iersel, Axel Janke, Scot Kelchner, Steven Kelk, James O. McInerney, David A. Morrison, Luay Nakhleh, Mike Steel, Leen Stougie, and James Whitfield. Networks: expanding evolutionary thinking. Trends in Genetics, 29(8):439–441, 2013.
- [2] Magnus Bordewich, Charles Semple, and Andreas Spillner. Optimizing phylogenetic diversity across two trees. Applied Mathematics Letters, 22(5):638–641, 2009. doi:10.1016/j.aml.2008.05.004.
- [3] Magnus Bordewich, Charles Semple, and Kristina Wicke. On the complexity of optimising variants of phylogenetic diversity on phylogenetic networks. Theoretical Computer Science, 917:66–80, 2022. doi:10.1016/J.TCS.2022.03.012.
- [4] Olga Chernomor, Steffen Klaere, Arndt von Haeseler, and Bui Quang Minh. Split Diversity: Measuring and Optimizing Biodiversity Using Phylogenetic Split Networks, volume 14, pages 173–195. Springer Cham, 2016.
- [5] Tomás M. Coronado, Gabriel Riera, and Francesc Rosselló. An interchange property for the rooted phylogenetic subnet diversity on phylogenetic networks. Journal of Mathematical Biology, 89(5):48, 2024.
- [6] Paul Dagum and Michael Luby. Approximating the permanent of graphs with large factors. Theoretical Computer Science, 102(2):283–305, 1992. doi:10.1016/0304-3975(92)90234-7.
- [7] W. Ford Doolittle. Phylogenetic Classification and the Universal Tree. Science, 284(5423):2124–2128, 1999. doi:10.1126/science.284.5423.2124.
- [8] Daniel P. Faith. Conservation evaluation and phylogenetic diversity. Biological Conservation, 61(1):1–10, 1992.
- [9] Philippe Gambette, Vincent Berry, and Christophe Paul. The Structure of Level-k Phylogenetic Networks. In Proceedings of the 20th Annual Symposium on Combinatorial Pattern Matching (CPM 2009), pages 289–300. Springer, 2009. doi:10.1007/978-3-642-02441-2_26.
- [10] Klaas Hartmann, Dennis Wong, and Tanja Stadler. Sampling Trees from Evolutionary Models. Systematic Biology, 59(4):465–476, 2010.
- [11] Niels Holtgrefe, Leo van Iersel, Ruben Meuwese, Yuki Murakami, and Jannik Schestag. PANDA: Maximizing Phylogenetic Diversity in Network. Manuscript in preparation, 2025.
- [12] Daniel H. Huson, Regula Rupp, and Celine Scornavacca. Phylogenetic networks: Concepts, Algorithms and Applications. Cambridge University Press, 2010.
- [13] Laura Jetten and Leo van Iersel. Nonbinary Tree-Based Phylogenetic Networks. IEEE/ACM transactions on computational biology and bioinformatics, 15(1):205–217, 2016.
- [14] Mark Jones and Jannik Schestag. How Can We Maximize Phylogenetic Diversity? Parameterized Approaches for Networks. In Proceedings of the 18th International Symposium on Parameterized and Exact Computation (IPEC 2023), pages 30:1–30:12. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2023. doi:10.4230/LIPICS.IPEC.2023.30.
- [15] Mark Jones and Jannik Schestag. Maximizing Phylogenetic Diversity under Time Pressure: Planning with Extinctions Ahead. arXiv preprint arXiv:2403.14217, 2024. doi:10.48550/arXiv.2403.14217.
- [16] Mark Jones and Jannik Schestag. Parameterized Algorithms for Diversity of Networks with Ecological Dependencies. Manuscript in preparation, 2025.
- [17] Joshua A. Justison, Claudia Solis-Lemus, and Tracy A. Heath. SiPhyNetwork: An R package for simulating phylogenetic networks. Methods in Ecology and Evolution, 14(7):1687–1698, 2023.
- [18] Elizabeth Kolbert. The sixth extinction: an unnatural history. Henry Holt and Company, New York, 2014.
- [19] Christian Komusiewicz and Jannik Schestag. Maximizing Phylogenetic Diversity under Ecological Constraints: A Parameterized Complexity Study. In Proceedings of the 44th IARCS Annual Conference on Foundations of Software Technology and Theoretical Computer Science (FSTTCS 2024). Schloss-Dagstuhl-Leibniz Zentrum für Informatik, 2024.
- [20] Binh T. Nguyen, Andreas Spillner, and Vincent Moulton. Computing Phylogenetic Diversity for Split Systems. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 5(02):235–244, 2008. doi:10.1109/TCBB.2007.70260.
- [21] Fabio Pardi and Nick Goldman. Species Choice for Comparative Genomics: Being Greedy Works. PLoS Genetics, 1(6):e71, 2005. doi:10.1371/journal.pgen.0010071.
- [22] Roseli Pellens and Philippe Grandcolas. Biodiversity Conservation and Phylogenetic Systematics: preserving our evolutionary heritage in an extinction crisis. Springer Nature, 2016.
- [23] Jannik Schestag. Weighted Food Webs Make Computing Phylogenetic Diversity So Much Harder. Submitted for publication, 2025.
- [24] Jannik Schestag. Who Should Have a Place on the Ark? Parameterized Algorithms for the Maximization of the Phylogenetic Diversity. Doctoral dissertation, Friedrich Schiller University Jena, 2025. URL: https://www.fmi.uni-jena.de/fmi_femedia/33737/dissertation-schestag-25-pdf.pdf.
- [25] Mike Steel. Phylogenetic Diversity and the Greedy Algorithm. Systematic Biology, 54(4):527–529, 2005.
- [26] Scott A. Taylor and Erica L. Larson. Insights from genomes into the evolutionary importance and prevalence of hybridization in nature. nature ecology & evolution, 3(2):170–177, 2019.
- [27] Leo van Iersel, Mark Jones, Jannik Schestag, Celine Scornavacca, and Mathias Weller. Phylogenetic Network Diversity Parameterized by Reticulation Number and Beyond. In Proceedings of the 23rd RECOMB International Workshop on Comparative Genomics (RECOMB-CG 2025), Seoul, Republic of Korea, 2025.
- [28] Leo van Iersel, Judith Keijsper, Steven Kelk, Leen Stougie, Ferry Hagen, and Teun Boekhout. Constructing Level-2 Phylogenetic Networks from Triplets. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 6(4):667–681, 2009. doi:10.1109/TCBB.2009.22.
- [29] Leo van Iersel, Sjors Kole, Vincent Moulton, and Leonie Nipius. An algorithm for reconstructing level-2 phylogenetic networks from trinets. Information Processing Letters, 178:106300, 2022. doi:10.1016/j.ipl.2022.106300.
- [30] Mathias Weller. phylo_tools (c++20 branch). https://github.com/igel-kun/phylo_tools/tree/C%2B%2B20, 2025. Accessed: 2025-05-10.
- [31] Kristina Wicke and Mareike Fischer. Phylogenetic diversity and biodiversity indices on phylogenetic networks. Mathematical Biosciences, 298:80–90, 2018.
- [32] Kristina Wicke, Arne Mooers, and Mike Steel. Formal Links between Feature Diversity and Phylogenetic Diversity. Systematic Biology, 70(3):480–490, 2021.
Appendix A Appendix
A.1 Proof of Lemma 4.6
Even though only parts of the proof of Lemma 4.6 have been deferred to the appendix, we present the entire proof here.
Lemma A.1.
Let be a subset of sides and let be derived from according to Construction 1. Then for , the signature of is if and only if respects . Moreover, if , then .
Proof.
Observe that by construction, each tree in corresponds to a side , and as such respects if and only if .
So now assume that . We prove that .
Note that every arc in has a corresponding arc in . For ease of notation, we speak of the same arc as existing in and . Let denote the weight of an arc in , and let denote the weight of in . For any arc that is not a lowest path-arc in , let if has an offspring from in , and let otherwise. Observe that .
Recall that denotes the probability that, taken some switching , the arc has an offspring from in . By Lemma 4.3 we know .
Claim A.2.
For an arc side with lowest path-arc , and a path-arc of that is not , if then and, otherwise, .
Proof.
First suppose that and . Then has an offspring which is a leaf of side . As there are no reticulations on the path between and in , it holds that and so . On the other hand,
and so the claim holds.
Next suppose that . Then has no offspring from side in . It follows that has an offspring from in if and only if has an offspring from in , for any switching . Thus . But then , and so the claim holds. A similar argument holds for the case that .
Claim A.3.
For a leaf-arc with on side , it holds that if , and , otherwise.
Proof.
Observe that if and only if , and otherwise. Since has signature , it follows that if . On the other hand if , then if and only if and so . The claim then follows from the fact that when is a leaf-arc.
Recall that, for any lowest path-arc on side , it holds that by definition. Let denote the arc sides on and let denote the reticulation sides. Note that . By construction it is .
To prove the main claim, let denote the leaf-arcs of , let denote the path-arcs which are not the lowest path-arc on their side, and let denote the arcs which are the lowest path-arc of their side. For each , let denote the arcs in belonging to a side in , and let . Note that the arc set of is . For each arc , let denote the lowest path-arc on the side containing (note that if then ).
A.2 Proof of Lemma 4.9
Lemma A.4.
Max-PD can be solved in time on forests.
Proof.
Let be an instance of Max-PD with forest where is the root of . Let be an integer bigger than .
We define a tree with root , as the union of with additional arcs of weight for each . Solve the instance of Max-PD with Faith’s greedy algorithm and return the result.
Let be a solution for . Because respects , there is a taxon beneath . Consequently, is on the path from to and so .
Conversely, let be a solution for . As , we conclude that contains a taxon of each tree and therefore respects . Therefore, is a solution for .
