Abstract 1 Introduction 2 Preliminaries 3 #P-Hardness of computing APDN when saving all taxa 4 Maximizing Average-Tree Phylogenetic Diversity 5 Experiments 6 Discussion References Appendix A Appendix

Average-Tree Phylogenetic Diversity of Networks

Leo van Iersel ORCID TU Delft, The Netherlands Mark Jones ORCID TU Delft, The Netherlands Jannik Schestag ORCID TU Delft, The Netherlands Celine Scornavacca ORCID ISEM, Université de Montpellier, CNRS, IRD, EPHE, Montpellier, France Mathias Weller ORCID LIGM, CNRS, Univsersité Gustave Eiffel, Marne-la-Vallée, France
Abstract

Phylogenetic diversity is a measure used to quantify the biodiversity of a set of species. Here, we introduce the “average-tree” phylogenetic diversity score in rooted binary phylogenetic networks and consider algorithms for computing and maximizing the score on a given network. Basically, the score is the weighted average of the phylogenetic diversity scores in all trees displayed by the network, with the weights determined by the inheritance probabilities on the reticulation edges used in the embeddings. We show that computing the score of a given set of taxa in a given network is #P-hard, directly implying #P-hardness of finding a subset of k taxa achieving maximum diversity score and, thereby, ruling out polynomial-time algorithms for these problems unless the polynomial hierarchy collapses. However, we show that both problems can be solved efficiently if the input network is close to being a tree in the sense that its reticulation number is small. More precisely, we prove that we can solve the optimization problem in networks with n leaves and r reticulations in 2𝒪(r)nk time. Using experiments on data produced by simulating a reticulate-evolution process, we show that our algorithm runs efficiently on networks with hundreds of taxa and tens of reticulations.

Keywords and phrases:
phylogenetic diversity, phylogenetic networks, network phylogenetic diversity, algorithms, computational complexity
Funding:
Leo van Iersel: Partially funded by the Dutch Organisation for Scientific Research (NWO) grant OCENW.KLEIN.125 and OCENW.M.21.306.
Mark Jones: Partially funded by the Dutch Organisation for Scientific Research (NWO) grant OCENW.KLEIN.125 and OCENW.M.21.306.
Jannik Schestag: Partially funded by the Dutch Research Council (NWO), project “Optimization for and with Machine Learning (OPTIMAL)”, OCENW.GROOT.2019.015.
Celine Scornavacca: Partially funded by French Agence Nationale de la Recherche through the CoCoAlSeq Project (ANR-19-CE45-0012). This is the contribution ISEM 2025x-XXX of the Institut des Sciences de l’Evolution de Montpellier.
Copyright and License:
[Uncaptioned image] © Leo van Iersel, Mark Jones, Jannik Schestag, Celine Scornavacca, and Mathias Weller; licensed under Creative Commons License CC-BY 4.0
2012 ACM Subject Classification:
Applied computing Computational biology
; Theory of computation Fixed parameter tractability ; Theory of computation Problems, reductions and completeness
Editors:
Broňa Brejová and Rob Patro

1 Introduction

Human-driven habitat degradation and environmental change have led to accelerated rates of species extinction, posing a significant threat to global biodiversity [18]. In response, conservation efforts face both logistical constraints – such as limited financial resources – and conceptual challenges, notably in identifying effective prioritization strategies. This has motivated the development of formal criteria and quantitative indicators to guide decision-making and assess biodiversity in a systematic and reproducible manner. Here, we study formal measures for quantifying the biodiversity of a given set of species and algorithms for computing and maximizing these measures.

Phylogenetic Diversity (PD) is a well-studied measure used to quantify the biodiversity of a set of species using their evolutionary relationships [22]. When these relationships are described by a rooted phylogenetic tree (with edge lengths) on a set X of species, the phylogenetic diversity of a subset of the species AX is simply the total length of all edges that lie on at least one path from the root of the tree to a leaf in A [8]. The main intuition behind this is that species that are evolutionarily further apart together contribute more diversity than species that are evolutionarily closely related. Phylogenetic diversity can more formally be interpreted as follows in terms of features (e.g., genes or morphological features). Suppose that (1) the number of features introduced on each edge of the tree is proportional to the length of that edge, (2) features are never lost, and (3) no feature is introduced independently on multiple branches. Then, the number of features in A is proportional to its phylogenetic diversity. While the assumptions on the model may seem somewhat unrealistic, it has been used prominently in the literature [31, 3, 5, 24].

In many cases, however, describing evolutionary relationships by trees is too simplistic [7]. Rooted phylogenetic networks generalize rooted phylogenetic trees and are more suited to represent complex evolutionary scenarios involving, for example, hybridization events [12, 1, 26]. Roughly speaking, a rooted phylogenetic network is a directed acyclic graph with a single root and labeled leaves. Vertices with two (or more, in the nonbinary case) incoming arcs are called reticulations and can be used to model hybridization events, lateral gene transfer, or other types of horizontal evolutionary events. Generalizing the notion of phylogenetic diversity to networks is not obvious, and multiple variants have recently been proposed for rooted [32, 3, 5] and unrooted [20, 2, 4] networks.

The currently best-studied variants are AllPathsPD and NetworkPD [3, 14, 27, 5, 24]. Both definitions generalize PD on trees, in the following ways. AllPathsPD(A) is the total length of all edges that lie on at least one path from the root of the network to a leaf in A. NetworkPD generalizes AllPathsPD by allowing arcs to have inheritance probabilities, representing the probability that a given feature in the parent is inherited by the child. The idea of NetworkPD(A) is, roughly speaking, to compute the expected number of distinct features in A by using the inheritance probabilities to compute, for each arc uv, the probability γuv that at least one copy of a feature introduced on the arc uv is inherited by a leaf in A. More precisely, γuv is defined as γvw times the inheritence probability of uv if v is a reticulation with child w, as 1i(1γvwi) if v is a tree node with children wi, and as 1 or 0 if v is a leaf, depending on whether or not vA.

Note that, in the NetworkPD setting, features are inherited independently from different parents. For example, in the network in Figure 1, a feature introduced on the arc ρu is present in all parents of the reticulation above species B but it has only a 0.75 probability of being inherited by B. In addition, in NetworkPD it is possible to inherit the same feature from both parents, resulting in the presence of multiple copies of that feature. While this can be realistic in certain scenarios (e.g. allopolyploidy), in other cases it is more realistic to assume that each feature is inherited from at most one parent.

A less studied alternative averages the phylogenetic diversity score over all trees displayed by the network [31]. Each tree has an assigned probability equal to the product of the inheritance probabilities of the reticulation edges used by its embedding, and the PD-score of each tree is weighted by this probability. Thus, the resulting measure can be interpreted as the expected diversity of any tree displayed by the network. In this model, features inherited from different edges incoming to the same reticulation vertex are not considered to be independent. The assumption is basically that each feature is inherited from exactly one parent (or introduced on that edge). Note however that Wicke and Fischer [31] use an uncommon definition of displayed trees, in which trees are only considered to be displayed if they contain all vertices of the network and all their leaves are in X. In particular, a network that is not tree-based [13] would have no displayed trees and an undefined phylogenetic diversity score. Since we see no reason to ignore certain displayed trees, we will take the (weighted) average over all displayed trees in this paper. We call the resulting phylogenetic diversity measure on phylogenetic networks the average-tree phylogenetic diversity, APD for short, see Figure 1 for an example.

Figure 1: A rooted phylogenetic network 𝒩 on X={A,B,C,D} along with the two rooted phylogenetic trees T1,T2 that 𝒩 displays. Edge weights are indicated above the edges. Reticulation edges are in blue with their inheritance probability in blue below. The APD score of {B,D} is 22 in T1 and in T2 and, hence, also in 𝒩. The NetworkPD score of {B,D} is 20 since γvA=γwC=0, γrB=γρD=1, γvr=γuv=γwr=γuw=0.5, and γρu=0.75, implying a score of 2+0.5(2+2+1+3)+0.758+8=20. However, the NetworkPD score of {C,D} is 3+3+8+8=22. It turns out that the only size-2 set maxizing the NetworkPD score is {C,D}, while {B,D} is the only size-2 subset maximizing the APD score.

While APD is a natural and biologically relevant generalization of phylogenetic diversity from trees to networks, we show that it comes with great challenges. Although for most PD measures on networks, such as AllPathsPD and NetworkPD, the diversity score of a given set AX of species can be easily computed in polynomial time, we will show in this paper that for APD this is already computationally hard (more precisely, #P-hard; a polynomial-time algorithm for it would imply a collapse of the polynomial hierarchy.) Consequently, the maximization problem associated to APD is also computationally hard. This problem, called Max-APD, aims to find a set AX of k species with maximum APD score.

On the positive side, we show that, although both these problems are #P-hard, they can both be solved exactly and rather efficiently in practice. We do this by presenting a parameterized algorithm finding a size-k set of leaves that maximizes the APD score in binary networks with r reticulations and n leaves in 𝒪(26rnk) time. This algorithm can easily be adapted to compute the score of a given set A. The running time of our algorithm scales linearly with the number of species but exponentially with the number of reticulations in the network. Using practical experiments on simulated data we show that the algorithm can solve instances with up to 500 species and 55 reticulations within 5 minutes on 16 GB RAM and 8 CPUs computer, with k5. In addition, the algorithm efficiently solves instances with k up to at least 40 if the number of reticulations is fixed as 12.

1.1 Related Literature

There is a rich body of literature on phylogenetic diversity on (rooted and unrooted) phylogenetic trees. Most relevant to our work, the optimization problem of finding k species with maximum phylogenetic diversity can be solved in polynomial time on (rooted and unrooted) trees [21, 25]. Further research found fast algorithms on trees, where biological dependencies were considered also [19, 23, 16]. However, only a few papers have studied phylogenetic diversity on phylogenetic networks [32, 3, 15, 27].

Wicke and Fischer [31] introduced several generalizations of phylogenetic diversity of rooted trees to rooted networks, including AllPathsPD and a variant of APD (as discussed above). The question of efficient computability is not addressed, nor are the corresponding maximization problems. Instead, the authors define several biodiversity indices on phylogenetic networks, which can be used to rank species for conservation. The associated maximization problems are first addressed by Bordewich et al. [3], showing that the decision versions of the maximization problems of AllPathsPD and NetworkPD are NP-hard, even for tree-child networks but, on level-1 networks,111The level of a network is the maximum reticulation number of any biconnected component. It may be arbitrarily smaller than the reticulation number of the network. a solution maximizing the AllPathsPD score can be found in polynomial time. They also introduce two more variants, MaxWeightTree-PD and MinWeightTree-PD. In these, the PD𝒩(A) of taxa is the maximum and minimum weight of the phylogenetic diversity of A in any display tree. While for the former, the maximization problem is polynomial-time solvable, for the latter, computing the score of a given subset of the species is already NP-hard.

The AllPathsPD problem was shown to also be susceptible to greedy approaches, as long as the input network is semi-binary (maximum in-degree two) and has level111The level of a network is the maximum reticulation number of any biconnected component. It may be arbitrarily smaller than the reticulation number of the network. at most two [5].

Jones and Schestag [14] studied the maximization problem for AllPathsPD in more detail. They showed that it is W[2]-hard when the parameter is the number of species to save, but fixed-parameter tractable (FPT) when the parameter is the optimal phylogenetic diversity, the acceptable loss of phylogenetic diversity, the number of reticulations in the network, or the treewidth of the underlying graph. Recently, this measure has been considered in semidirected networks [11].

In a recent study [27], the maximization problem for NetworkPD has been shown to be FPT with respect to the number of reticulations of the network. Unfortunately, this result cannot be strengthened by using the level as a parameter, since NetworkPD remains NP-hard on level-1 networks [27].

2 Preliminaries

Phylogenetic Networks.

In this paper, we consider binary, rooted phylogenetic networks with integer weights on arcs, in which each reticulation has an associated probability distribution on its incoming arcs. Formally, for a set X of taxa, a binary, rooted phylogenetic network on X, later only called phylogenetic network, is a tuple 𝒩=(X,V,A,,ω,λ) satisfying the following conditions:

  • (V,A) forms a directed acyclic graph with vertices V and arcs A, in which L(𝒩) is the set of leaves (vertices with in-degree 1 and out-degree 0), and for which

    • There is a single vertex of in-degree 0 and out-degree 2, called the root ρ(𝒩);

    • Vertices of in-degree 2 and out-degree 1 are called the reticulations and denoted R(𝒩);

    • The remaining vertices have in-degree 1 and out-degree 2 and are called the tree nodes.

    An arc uvA is a reticulation arc if vR(𝒩) and AR(𝒩) is the set of reticulation arcs.

  • :L(𝒩)X is a bijective mapping of leaves of 𝒩 to taxa in X and is called labelling of 𝒩. We refer to a leaf of 𝒩 interchangeably with the taxa it is labeled with.

  • ω:A0 assigns each arc uv a non-negative integer, called the (arc-)weight of uv.

  • λ:AR(𝒩)[0,1] is a function that assigns each reticulation arc a real value between 0 and 1, subject to the constraint that λ(u1v)+λ(u2v)=1 for any reticulation v with incoming arcs u1v and u2v. We call λ(uv) the inheritance probability of uv. Informally, λ(uv) represents the probability that a randomly-selected gene in v is inherited from u.222We explicitly point out the difference to NetworkPD, where λ(uv) represents the proportion of genes in u that are inherited by v [3, 27]. Here, λ(uv) is the probability that a randomly chosen switching σ has σ(v)=e. Consequently, λ(e) gives the probability that anything is inherited along e.

We often refer to the graph (V,A) as 𝒩 and we refer to vV as a vertex of 𝒩 and uvA as an arc in 𝒩. For such an arc uv, we call u a parent of v, and v a child of u. For a vertex v of a phylogenetic network, the offspring of v is the set of leaves reachable from v with a direct path. We will permit ourselves to write V(𝒩) for V and A(𝒩) for A. A phylogenetic tree is a phylogenetic network with no reticulations.

Phylogenetic Diversity.

For a phylogenetic tree 𝒯=(V,A) on X, the phylogenetic diversity PD𝒯(Z) of a set ZX is the total weight of arcs eA that are on a path from the root to a leaf in Z [8]. In this section, we generalize this measure to phylogenetic networks. For an example consider Figure 2.

Figure 2: Left: A network with edge-weights (above each edge) and reticulation edges in blue with their inheritance proportions in blue below. Reticulation vertices are filled black. Right: The four possible switching-trees with edges marked in orange if they are considered when Z={A,B} is saved, yielding APD𝒩(Z)=.4120+.1180+.1190+.4190=161.

A switching σ:R(𝒩)A(𝒩) is a function that maps each reticulation to one of its incoming arcs. That is, a switching corresponds to a decision about which incoming arc each reticulation will inherit from. The set of switchings in 𝒩 is denoted by 𝒮(𝒩). The probability P(σ) of a switching σ is rR(𝒩)λ(σ(r)). That is, we assume that for each reticulation, one of its incoming arcs is chosen (with probability given by λ) independently at random.

The switching-tree Tσ for a switching σ is the subgraph of 𝒩 derived by deleting all reticulation arcs except for those in {σ(r)rR(𝒩)}. Observe that Tσ is a directed tree, possibly with some “dead” (unlabelled) leaves and vertices of degree 2. Note that X is a subset of the leaves of Tσ, for each switching σ. Given a switching σ and a set of taxa ZX, we define PDσ(Z) as PDTσ(Z) under the usual definition of phylogenetic diversity on trees.

Now, we can define the measure we study in this paper. For a phylogenetic network 𝒩 and a set ZX of taxa, the average phylogenetic diversity of Z with respect to 𝒩 is

APD𝒩(Z):=σ𝒮(𝒩)P(σ)PDσ(Z)

Informally, APD𝒩(Z) is the amount of phylogenetic diversity that we expect to preserve by saving Z, under the assumption that each reticulation independently chooses one of its parents to inherit from. Note that this is equivalent to the definition given in the introduction based on displayed trees since suppressing degree-2 vertices and deleting unlabelled leaves does not affect the phylogenetic diversity score of a tree. The associated decision problem is as follows.

Max-APD Input: A phylogenetic network 𝒩 and integers k and D. Question: Is there a set ZX of taxa which has a size of k and with APD𝒩(Z)D?

3 #P-Hardness of computing APDN when saving all taxa

In the following, we prove that computing APD𝒩 is #P-hard, even if all taxa are being saved, in contrast to other phylogenetic diversity measures on networks like NetworkPD [3, 27] or AllPathsPD [3, 14, 16], that can be computed (not optimized) in polynomial time.

Theorem 3.1.

Computing APD𝒩(X) for a phylogenetic network 𝒩 on X is #P-hard.

Figure 3: Left: part of graph G. Right: the corresponding part of 𝒩. Observe that ue has a directed path to xv in 𝒩 if and only if e is incident with v in G.

We prove the theorem by reduction from Counting Perfect Matchings. A perfect matching of an undirected graph G=(V,E) is a set of |V|/2 edges EE such that each vertex of G is incident with exactly one edge of E. Counting the number of perfect matchings in a cubic graph is #P-hard [6, Theorem 6.2].

We first observe that the number of perfect matchings in a graph G=(V,E) is the same as the number of mappings ϕ:VE that are perfect assignments, as defined next. A mapping ϕ:VE is an assignment if ϕ(v) is an incident edge of v for all vV. An assignment ϕ is perfect if |ϕ(V)|=|V|/2. As |ϕ1(e)|2 for each edge e, an assignment ϕ is perfect if and only if |ϕ1(e)|=2 for each eϕ(V). That is, for each eE, a perfect assignment ϕ assigns either both or none of the incident vertices to e. Thus, mapping each perfect assignment ϕ to ϕ(V) constitutes an injection from perfect assignments to perfect matchings in G. On the other hand, if EE is a perfect matching in G, then a perfect assignment ϕ:VE can be defined from E by setting ϕ(v) to be the unique edge in E incident to v for all vV. It is easy to see that this mapping is an injection from the set of all perfect matchings in G to the set of perfect assignments. Thus, the perfect matchings in G are in bijection with its perfect assignments, so their number is the same. In the following, we let MG denote this number.

Given an arbitrary cubic graph G=(V,E), we construct a network 𝒩 on a set of taxa X in such a way that MG can be determined from APD𝒩(X). See Figure 3 for an example.

Let X:={xvvV}, that is, X has one element for each vertex in G. Now, for each vV, add two vertices rv1 and rv2, where rv2 is a parent of rv1, and rv1 is a parent of xv. For each eE, add a vertex ue to 𝒩. For any vertex vV, choose one of its three incident edges e arbitrarily (recall that G is cubic), and add an arc uerv1. Then, for the remaining two edges e and e′′ incident with v, add the arcs uerv2 and ue′′rv2. Observe that, so far, ue has out-degree 2 and in-degree 0 for each eE, and rv1 and rv2 are reticulations for all vV. Furthermore, ue has a directed path to xv in 𝒩 if and only if e is incident to v in G.

Create two new special vertices u1 and u2 and add an arc u1u2 (this will be the only arc that gets a large weight). Add a root ρ as the parent of u1. Now, for each eE, add a new vertex re as the parent of ue. For each vertex re, add two parents pe and qe. Add arcs such that {u1}{qeeE} and {u2}{peeE} form paths. We now have that each vertex re is a reticulation, and exactly one of the incoming arcs of re is on a directed path from the root that passes through u1u2. In Figure 4 an illustration is given.

Figure 4: Example of the top part of the network 𝒩, given a graph G with 5 edges. The heavy arc u1u2 is in bold and blue. For each of the reticulations re1 to re5, the inheritance probability of the left incoming arc is (1δ), and the inheritance probability of the right incoming arc is δ.

We define the arc weights by setting ω(u1u2):=H for H a large value to be defined later, and setting ω(a):=1 for all other arcs aA(𝒩).

It remains to define the inheritance probabilities on the reticulation arcs.

First, consider re for some eE. Then, set λ(pere):=1δ and λ(qere):=δ, where δ(0,1) is a small constant to be defined later. Note that, when we choose 𝒯σ with probability P(σ) then, for each eE, the probability that u2 has a path to some re is exactly 1δ.

Now, consider rv1 and rv2 for some vV. Then, set λ(rv2rv1):=2/3, λ(a1)=1/3, and set λ(a2):=λ(a3)=1/2, where a1 is the other arc incoming at rv1 and a2 and a3 are the two arcs incoming at rv2. The idea here is that when we choose a switching σ and corresponding tree 𝒯σ with probability P(σ), each vertex ue for eE adjacent to v has probability 1/3 of having a path to xv in 𝒯σ. As exactly one such xe will have a path to xv, we can think of the choice of σ as corresponding to a choice of assignment (where v is mapped to e in the assignment if ue has a path to xv in 𝒯σ). Our choice of inheritance probabilities ensures that each of the three edges incident to v is chosen with probability 1/3, independently for each v, and so each possible assignment is chosen with equal probability. This completes the reduction.

Given a switching σ, let χσ:=1 if u2 has a path to some leaf xv in 𝒯σ, and χσ:=0 otherwise. Thus, the expected contribution of u1u2 to APD𝒩(X) is Hσ𝒮(R)P(σ)χσ. We first show that this value depends entirely on the possible assignments of G.

Lemma 3.2.

σ𝒮(R)P(σ)χσ=(1/3)|V|ϕ(1δ|ϕ(V)|).

Proof.

We first fix some notation. A partial switching on RR is the restriction of a switching to the subset R of reticulations. Let 𝒮(R) denote the set of all partial switchings on R. For a partial switching σ𝒮(R), let P(σ):=rRλ(σ(r)).

Let RV be the set of reticulations {rvivV,i{1,2}}, and let RE be the set of reticulations {ueeE}. If σ is a switching and σV and σE are the restrictions of σ to RV and RE respectively, then we write σ=σVσE. Observe that P(σ)=P(σV)P(σE) in this case and that R=RVRE and therefore 𝒮(R)={σVσEσV𝒮(RV),σE𝒮(RE)}.

For a partial switching σV on RV, we can associate an assignment ϕ on G, as follows. For any vV, let ee, and e′′ be the incident edges of v in G, and without loss of generality assume that ue is a parent of rv1 and the parents of rv2 are ue and ue′′. We say a partial switching σ𝒮(RV) and assignment ϕ agree on v if one of the following holds: 1. ϕ(v)=e and σ(rv1)=uerv1; 2. ϕ(v)=e and σ(rv1)=rv2rv1, σ(rv2)=uerv2; or 3. ϕ(v)=e′′ and σ(rv1)=rv2rv1, σ(rv2)=ue′′rv2. That is, ϕ(v) is the unique edge incident to v for which the switching σV chooses a set of reticulation arcs that connect ue to xv. See Figure 5. An assignment ϕ and a partial switching σ are associated if ϕ and σ agree on all vV.

Figure 5: Left: part of a graph and an assignment ϕ on v1 and v2. Right: partial switching σV on this segment. σV and ϕ agree on v1, but not on v2.

For assignments ϕ on G, let 𝒮ϕ denote the set of all partial switchings σ𝒮(RV) associated with ϕ. We use these sets 𝒮ϕ to decompose the sum σ𝒮(R)P(σ)χσ, as follows:

σ𝒮(R)P(σ)χσ =σ𝒮(RV)σ′′𝒮(RE)P(σσ′′)χσσ′′
=σ𝒮(RV)σ′′𝒮(RE)P(σ)P(σ′′)χσσ′′
=ϕσ𝒮ϕP(σ)σ′′𝒮(RE)P(σ′′)χσσ′′

The next two claims allow us to express the above in terms of |ϕ(V)| for all assignments ϕ.

Claim 3.3.

σ′′𝒮(RE)P(σ′′)χσσ′′=1δ|ϕ(V)|, for each assignment ϕ on G and σ𝒮ϕ.

Proof.

Let σ=σσ′′ for some σ′′𝒮(RE) and recall that χσ=1 if and only if u2 has a path to some leaf xv in 𝒯σ. Any such path must pass through some ue, and ue only has a path in Tσ to some xv if e is assigned to v by ϕ i.e. ϕ(v)=e. Thus, χσ=1 if and only if u2 has a path in 𝒯σ to ue for some eϕ(V). This occurs if and only if σ′′(re)=pere for some eϕ(V). Or, equivalently, χσ=0 if and only if σ′′(re)=qere for all eϕ(V). It follows

σ′′𝒮(RE)P(σ′′)χσσ′′ =1σ′′𝒮(RE)eϕ(V):σ′′(re)=qereP(σ′′)
as, roughly speaking, the sum of the probabilities of partial switchings that choose the qere-arcs equals the probability that a random switching makes all these choices, this equals
=1eϕ(V)λ(qere)=1eϕ(V)δ=1δ|ϕ(V)|.

Claim 3.4.

For any assignment ϕ on G, it holds that σ𝒮ϕP(σ)=(1/3)|V|.

Proof.

Recall that, for any σ𝒮(RV) and any assignment ϕ, we have σ𝒮ϕ if and only if σ agrees with ϕ on all vV. Recall also that for any vV, whether σ and ϕ agree on v depends only on ϕ(v), σ(rv1) and σ(rv2). So suppose we choose a partial switching σ𝒮(Rv) with probability P(σ). Note that the probability that σ𝒮ϕ is σ𝒮ϕP(σ).

Now consider the probability that σ agrees with ϕ on some vV. If ϕ(v)=e for the edge e such that uerv1A(𝒩), then σ agrees with ϕ on v if and only if σ(rv1)=uerv1, and this occurs with probability λ(uerv1)=1/3. Now, if ϕ(v)=e where ue is one of the parents of rv2, then σ agrees with ϕ on v if and only if σ(rv1)=rv2rv1 and σ(rv2)=uerv2, and this occurs with probability λ(rv2rv1)λ(uerv2)=2/31/2=1/3. In either case, σ and ϕ agree on v with probability 1/3. As these agreements are independent for each vV, the probability that σ agrees with ϕ on all vV is (1/3)|V|. But this is exactly the probability that σ𝒮ϕ, which is σ𝒮ϕP(σ). By ˜3.3 and ˜3.4, we have

ϕσ𝒮ϕP(σ)σ′′𝒮(RE)P(σ′′)χσσ′′ =ϕσ𝒮ϕP(σ)(1δ|ϕ(V)|)
=ϕ(1/3)|V|(1δ|ϕ(V)|)
=(1/3)|V|ϕ(1δ|ϕ(V)|)

Lemma 3.2 gives us an expression for the expected contribution Hσ𝒮(R)P(σ)χσ, in terms of the values |ϕ(V)| for all assignments ϕ on G. We next show that this value can be restricted to a certain range depending on on the number MG of perfect assignments. For this next step, we now fix the value of δ such that δ(1/3)|V|1/2. Let α:=(1/3)|V|δ|V|/2.

Lemma 3.5.

1αMGσ𝒮(R)P(σ)χσ1α(MG+1/2).

Proof.

From Lemma 3.2 we have that σ𝒮(R)P(σ)χσ=(1/3)|V|ϕ(1δ|ϕ(V)|). As there are exactly 3|V| assignments ϕ on G, this is equivalent to

σ𝒮(R)P(σ)χσ=(1/3)|V|3|V|(1/3)|V|ϕδ|ϕ(V)|=1(1/3)|V|ϕδ|ϕ(V)|

Recall that |ϕ(V)||V|/2 for each assignment ϕ, with equality if and only if ϕ is a perfect assignment. Then, counting only the perfect assignments,

1(1/3)|V|ϕ perfectδ|ϕ(V)|=1(1/3)|V|MGδ|V|/2=1αMG

proving the first inequality.

For the second inequality, observe that |ϕ(V)||V|/2+1 for any imperfect assignment ϕ. (As G is cubic, |V| is even). Then, counting perfect and imperfect assignments separately, and using the fact that there are at most 3|V| imperfect assignments, we have

1(1/3)|V|ϕδ|ϕ(V)| 1(1/3)|V|MGδ|V|/2(1/3)|V|3|V|δ|V|/2+1
1αMGδ|V|/2δ
1αMGδ|V|/2(1/3)|V|1/2
=1α(MG+1/2)

By what we have seen so far, the expected contribution of arc u1u2 to APD𝒩(X) is between H(1α(MG+1/2)) and H(1αMG), where MG is the number of perfect assignments (and therefore the number of perfect matchings) on G. It remains to show that this term dominates the other terms in APD𝒩(X), such that APD𝒩(X) is also bounded by a range dependent on MG. For this, we fix H such that H>2|A(𝒩)|/α. That is, the weight of u1u2 is more than 2/α times the total weight of all the other arcs in 𝒩.

Lemma 3.6.

We have

H(1α(MG+1/2))APD𝒩(X)<H(1α(MG1/2))

or equivalently, MG1/2<(HAPD𝒩(X))/(Hα)MG+1/2.

Proof.

First, consider the tree 𝒯σ associated with an arbitrary switching σ. If u2 has a path to a leaf xv in 𝒯σ (i.e. if χσ=1) then u1u2 has an offspring in X, and so PDσ(X)ω(u1u2)=H. Thus PDσ(X)Hχσ. As the weight of all other arcs is 1, we also have that PDσ(X)<Hχσ+|A(𝒩)|Hχσ+Hα/2. Then, by Lemma 3.5,

APD𝒩(X) =σ𝒮(R)P(σ)PDσ(X)Hσ𝒮(R)P(σ)χσH(1α(MG+1/2))
and
APD𝒩(X) =σ𝒮(R)P(σ)PDσ(Z)
<Hσ𝒮(R)(P(σ)χσ(Z))+Hσ𝒮(R)(P(σ)α/2)
H(1αMG)+Hα/2=H(1α(MG1/2))

Given Lemma 3.6, we can count the number of perfect matchings in a cubic graph G as follows. Construct the network 𝒩 on X as described above in polynomial time, then compute M:=(HAPD𝒩(X))/(Hα). Determine the unique integer M for which we have M1/2<MM+1/2. By Lemma 3.6, M is the number of perfect matchings in G.

As counting perfect matchings is #P-hard, it follows that computing APD𝒩(X) is also #P-hard. This completes the proof of Theorem 3.1.

4 Maximizing Average-Tree Phylogenetic Diversity

In this section, we show that Max-APD is fixed-parameter tractable (FPT) with respect to the number of reticulations. For the sake of brevity, we restrict our theoretical proof to simple networks – that is, 𝒩 has no cut-arcs except for those incident to degree-1 vertices. However, we see no reason why this result could not be extended to non-simple networks. Indeed, our implementation, see Section 5, can deal with arbitrary networks.

We make use of the notion of generators, as introduced in [28]. This gives us a certain useful partition of the leaves of a network into sides. At the high level, our algorithm “guesses” the optimal solution Z by guessing which sides of the network 𝒩 contain a leaf from Z. As the number of sides is 𝒪(|R(𝒩)|, there are at most 2𝒪(|R(𝒩)|) such guesses to consider. Once it has been decided which sides contain a leaf from the solution, the problem reduces to a problem on forests, which can be solved in polynomial time.

In what follows, let 𝒩 be a network and let r:=|R(𝒩)|, n:=|V(𝒩)| and m:=|A(𝒩)|. That is, r, n, and m denote the number of reticulations, vertices, and arcs, respectively.

4.1 Characterizing solutions in terms of generator sides

We recall the notion of generators, introduced in [28]. Our definitions are taken from [29].

Definition 4.1.

Let 𝒩 be a simple network. The generator of 𝒩 is the directed multigraph G obtained from 𝒩 by (1) deleting nodes with in-degree one and out-degree zero (leaves) and (2) suppressing all nodes with in- and out-degree one. The arcs and in-degree-2 out-degree-0 vertices of G are called sides. The arcs are also called arc sides, and the in-degree-2 out-degree-0 vertices are also reticulation sides. We say that a leaf  is on side S (or that side S contains ) if either

  • S is a reticulation side of G and the parent of in 𝒩, or

  • S is an arc side of G, obtained by suppressing in-degree-1 out-degree-1 vertices of a path P in 𝒩 and the parent of in N lies on P.

In addition, if S is an arc side of G, obtained by suppressing in-degree-1 out-degree-1 vertices of a path P in 𝒩, then we say that every arc in P is a path-arc of side S. We call the lowest arc in P the lowest path-arc of S. Observe that every arc in 𝒩 is either a leaf-arc, or a path-arc of some side.

For an arc side S, let TS denote the subgraph of 𝒩 whose vertices are the leaves on side S together with the vertices incident to any path-arcs of S, and whose arcs are the leaf-arcs incident to the leaves of S together with the path-arcs of S. Note that TS is a tree (though not necessarily a phylogenetic tree). We call TS the side-tree of S.

The generator of a simple binary network with r reticulations has at most r reticulation sides and at most 4r2 arc sides [9].

In order to find solutions for Max-APD on simple networks, we characterize possible solutions in terms of their relation to the generator of 𝒩.

Definition 4.2.

For a simple phylogenetic network 𝒩 on X, let 𝒮(𝒩) denote the sides of the generator of 𝒩. For ZX, the signature SigZ of ZX is the set of all sides that contain at least one leaf of Z.

Now, to find a solution for Max-APD, it is enough to compute, for each signature Sig, the maximum value of APD𝒩(Z) over all sets ZX with signature Sig and |Z|k.

4.2 Contribution of lowest path-arcs

For an arc e in 𝒩, a set of taxa ZX and a switching σ:R(𝒩)A(𝒩), define χ(e,Z,σ) to be 1 if e has an offspring from Z in Tσ, and 0 otherwise (if e is not an arc in Tσ then define χ(e,Z,σ) to be 0). Then define ψ𝒩(e,Z):=switching σP(σ)χ(e,Z,σ)

Intuitively, ψ(e,Z) corresponds to the expected probability that e has an offspring in Z under a randomly selected switching. In the next lemma, we show that ω(e)ψ𝒩(e,Z) gives the contribution of arc e to APD𝒩(z).

Lemma 4.3.

APD𝒩(Z)=eA(𝒩)ω(e)ψ𝒩(e,Z).

Proof.

By definition, PDσ(Z)=eA(Tσ)ω(e)χ(e,Z,σ)=eA(𝒩)ω(e)χ(e,Z,σ). Thus,

APD𝒩(Z)= switching σP(σ)PDσ(Z)
= switching σP(σ)eA(𝒩)ω(e)χ(e,Z,σ)
= eA(𝒩)ω(e)switching σP(σ)χ(e,Z,σ)=eA(𝒩)ω(e)ψ𝒩(e,Z).

Thus, to compute APD𝒩(Z), it is enough to compute ψ𝒩(e,Z) for each arc e. The next lemma shows that for certain arcs, the value of ψ𝒩(e,Z) depends only on the signature of Z. This will be important for the construction that follows.

Lemma 4.4.

Let e be the lowest path-arc of a path side S and let Z,ZX be two subsets of X with the same signature. Then ψ𝒩(e,Z)=ψ𝒩(e,Z).

Proof.

By the definition of ψ𝒩(e,Z), it suffices to show that χ(e,Z,σ)=χ(e,Z,σ) for each switching σ. So suppose that e has an offspring zZ in Tσ such that χ(e,Z,σ)=1. Let S be the side containing z. As Z and Z have the same signature, S also contains some leaf zZ. Since z and z are on the same side and z is an offspring of the arc e in Tσ, also z is an offspring of e in Tσ. Thus z is an offspring of e in Tσ, and so χ(e,Z,σ)=1. By a symmetric argument, from χ(e,Z,σ)=1 we conclude χ(e,Z,σ)=1. Thus χ(e,Z,σ)=χ(e,Z,σ) for all σ and we can conclude that ψ𝒩(e,Z)=ψ𝒩(e,Z).

For a lowest path-arc e and signature 𝒮𝒮(𝒩), define ϕ(e,𝒮) to be ψ𝒩(e,Z) for any ZX with with signature SigZ=𝒮. By Lemma 4.4, this is well-defined.

Lemma 4.5.

For any lowest path-arc e and any ZX, the value of ψ𝒩(e,Z) can be computed in 𝒪(2rr) time. Consequently, ψ𝒩(e,𝒮) can be computed in 𝒪(2rr) time for each lowest path-arc e and signature 𝒮.

Proof.

Observe that as each reticulation has two incoming arcs and a switching maps each reticulation to one of its incoming arcs, there are exactly 2r switchings. For each switching σ, the value of P(σ)χ(e,Z,σ) for all e that are lowest arcs of a generator arc side, can be computed in 𝒪(r) time by a bottom-up traversal in the generator of 𝒩, registering whether each such e has an offspring from Z in Tσ. Adding up all values of P(σ)χ(e,Z,σ) (to compute ψ𝒩(e,Z)) can therefore be done in 𝒪(2rr) time.

To compute ψ𝒩(e,𝒮), it is sufficient to construct a set ZX with signature 𝒮 by picking an arbitrary leaf from each side S𝒮. As ψ𝒩(e,Z)=ψ𝒩(e,𝒮), we can compute this value in 𝒪(2rr) time.

4.3 Reduction to Forests

Let (𝒩,k,D) be an instance of Max-APD. Fix a signature 𝒮𝒮(𝒩) and assume there exists a solution Z with SigZ=𝒮 We now show how to reduce Max-APD under this assumption to a related problem on forests.

Let X1,,Xs be non-empty and pairwise disjoint subsets of X, and let F be a forest of Xi-trees Ti. Define X to be the union X1Xs. For technical reasons, we allow Ti to have non-integer arc weights and degree-2 vertices. A set ZX respects F if Z contains a taxon of each Xi for i{1,,s}. For a set ZX, define PDF(Z):=i=1sPDTi(ZXi). For a forest F and two integers k,D, we define a decision problem Max-PD in which we are asked whether a set ZX exists such that Z has a size of k, PDF(Z)D, and respects F. This definition generalizes the classic definition of Max-PD on trees [8].

Construction 1.

Initialize D=0, k=k, and let F be an empty set. Then for each side S in 𝒮(𝒩) do the following:

  • If S is a reticulation side, then let x be the single leaf on this side and let rx be its incoming arc. If S𝒮, then increase D by ω(rx), and add TS to F, where TS is a tree with a single leaf x and single arc rx of weight ω(rx).

  • If S is an arc side, then let eS=vr denote the lowest path-arc of S. For each path-arc e of S (including eS), add ω(e)ψ𝒩(e,𝒮) to D. Then if S𝒮, add TS to F, where TS is derived from the side-tree TS as follows: Multiply the weight of each path-arc by (1ψ𝒩(eS,𝒮)), then delete the lowest path-arc eS.

Lemma 4.6.

Let 𝒮𝒮(𝒩) be a subset of sides and let (D,F) be derived from (𝒩,𝒮) according to Construction 1. Then for ZX, the signature of Z is SigZ=𝒮 if and only if Z respects F. Moreover, if SigZ=𝒮, then APD𝒩(Z)=PDF(Z)+D.

Proof.

Observe that by construction, each tree in F corresponds to a side S𝒮, and as such Z respects F if and only if SigZ=𝒮.

So now assume that SigZ=𝒮. We prove that APD𝒩(Z)=PDF(Z)+D.

Note that every arc in F has a corresponding arc in 𝒩. For ease of notation, we speak of the same arc e as existing in F and 𝒩. Let ω𝒩(e) denote the weight of an arc e in 𝒩, and let ωF denote the weight of e in F. For any arc e that is not a lowest path-arc in 𝒩, let χF(e,Z):=1 if e has an offspring from Z in F, and let χF(e,Z):=0 otherwise. Observe that PDF(Z)=eA(F)ωF(e)χF(e,Z).

Recall that ψ𝒩(e,Z) denotes the probability that, taken some switching σ, the arc e has an offspring from Z in Tσ. By Lemma 4.3 we know APD𝒩(Z)=eA(𝒩)ω𝒩(e)ψ𝒩(e,Z).

Claim 4.7.

For an arc side S with lowest path-arc eS, and e a path-arc of S that is not eS, if S𝒮 then ω𝒩(e)ψ𝒩(e,Z)=ωF(e)χF(e,Z)+ω𝒩(e)ψ𝒩(eS,𝒮) and, otherwise, ω𝒩(e)ψ𝒩(e,Z)=ω𝒩(e)ψ𝒩(eS,𝒮).

Claim 4.8.

For a leaf-arc e=vx with x on side S, it holds that ω𝒩(e)ψ𝒩(e,Z)=ωF(e)χF(e,Z) if S𝒮, and ψ𝒩(e,Z)=0, otherwise.

Due to space constraints, ˜4.7 and 4.8 are proven in the appendix.

Recall that, for any lowest path-arc eS on side S, it holds that ψ𝒩(eS,Z)=ψ𝒩(eS,𝒮) by definition. Let 𝒜𝒮(𝒩) denote the arc sides on 𝒩 and let 𝒮(𝒩) denote the reticulation sides. Note that 𝒮(𝒩)=𝒜𝒮(𝒩)𝒮(𝒩). By construction it is D=S𝒜𝒮(𝒩)ω𝒩(eS)ψ(eS,𝒮).

To prove the main claim, let E1 denote the leaf-arcs of 𝒩, let E2 denote the path-arcs which are not the lowest path-arc on their side, and let E3 denote the arcs which are the lowest path-arc of their side. For each i{1,2,3}, let Ei1 denote the arcs in Ei belonging to a side in 𝒮, and let Ei0=EiEi1. Note that the arc set of F is E11E21. For each arc eE2E3, let eS denote the lowest path-arc on the side containing e (note that if eE3 then eS=e). Due to space constraints, APD𝒩(Z)=PDF(Z)+D is proven in the appendix.

Finally, we show that Max-PD can be solved efficiently on forests. For this, we use Faith’s famous greedy algorithm on trees [8, 21, 25].

Lemma 4.9.

Max-PD can be solved in 𝒪(nk) time on forests.

Due to space constraints, Lemma 4.9 is proven in the appendix.

4.4 Putting everything together

For any signature 𝒮𝒮(𝒩), let (F𝒮,D𝒮) denote the pair (F,D) derived for 𝒮 according to Construction 1. Lemma 4.6 implies that APD𝒩(Z)=PDF𝒮(Z)+D𝒮, for any ZX with signature 𝒮. As every ZX has a signature, it follows that (𝒩,k,D) is a Yes-instance of Max-APD if and only if (F𝒮,k,DD𝒮) is a Yes-instance of Max-PD for some 𝒮𝒮(𝒩). We can therefore solve Max-APD on (𝒩,k,D) using the following algorithm: For each 𝒮𝒮(𝒩) in turn, construct (F𝒮,D𝒮) and solve Max-PD on (F𝒮,k,DD𝒮). If the answer is Yes for any Max-PD instance, return Yes. Otherwise, return No.

As the generator of 𝒩 has at most 5r2 sides [9], there are at most 25r2 signatures to consider. For each signature 𝒮 and each lowest path-arc eS, the value of ψ𝒩(eS,𝒮) can be computed in 𝒪(2rr) time by Lemma 4.5. After these values are computed, (F𝒮,D𝒮) can be constructed in 𝒪(m)=𝒪(n+r) time. Finally, solving Max-PD on (F𝒮,k,DD𝒮) can be done in 𝒪(nk) time by Lemma 4.9. Thus, the total running time of the algorithm is in 𝒪(25r2(2rr+n+r+nk))𝒪(26r2nk).

5 Experiments

5.1 Data

Phylogenetic networks have been simulated333All data is publicly available at https://sdrive.cnrs.fr/s/aEFNN3iRp7goQBT. via the package SiPhyNetwork [17]. This package allows simulations of phylogenetic networks as well as traits developing along the networks, giving several customization options. In the following, we describe the choices made for our simulations, and we refer to the SiPhyNetwork paper for more details [17]. We simulated 5324 phylogenetic networks with 1, 20, …, 100, 150, … 500 leaves using the generalized sampling approach [10] implemented in SiPhyNetwork, with 100 simulation replicates, speciation rate equal to 1, extinction rate equal to 0.6, and hybridization rate varying from 0.1 to 0.001 for getting a reasonable reticulation number, even for large instances (see distributions of instances in Figure 6).

In our simulations, we aimed at simulating hybridizations between plants and, for doing so, we opted for the “lineage generative hybridizations” implemented in SiPhyNetwork, with inheritance probability fixed to 1/2 and a trait model where hybridization occurs only among species with equal ploidy. The rate of autopolyploidy has been fixed to 0.05. In this simulation setting, the number of leaves and the number of reticulations are correlated, so it is unlikely to produce large networks with small reticulation number or small networks with large reticulation number.

Figure 6: Distribution of instances for the number n of leaves, and the reticulation number r.

5.2 Implementation Details

The algorithm presented in Section 4 was implemented in C++ in a serial manner (no multithreading) in the framework of the phylo-tools library [30] (both published under the open source CeCILL Free Software License 2.1) with a few heuristic improvements that we discuss in the following.

Guesses on the Generator.

Instead of “guessing” which of the 𝒪(r) sides of the generator have tree-paths to selected leaves as stated in Construction 1, it is sufficient to guess which nodes of the generator have tree-paths to selected leaves. The nodes are pre-filtered, so the guess is done only on nodes that actually have tree-paths to leaves in the input network. Then, all ik(𝒪(r)i) subsets of size at most k of such nodes are enumerated. Note that those are polynomially many if k is constant (with the degree depending on k).

Enumerating Switchings.

Given the information which generator nodes have tree-paths to selected leaves, we can group the possible switchings in order to avoid enumerating all 2r of them. Indeed, if Z contains no leaf below some reticulation v, then the switching of its reticulation arcs has no bearing on the value of χ(e,Z,σ) for any e. This allows us to restrict enumeration to switchings involving reticulations above leaves in Z. If the input network is mostly “flat”, that is, most paths contain only constantly many reticulations, then we expect the enumeration-step to take 𝒪((rk)), rather than 𝒪(2r) steps. For fixed k, we would therefore expect polynomial running time (with the degree depending on k).

Keeping Multiple Solutions.

In order to assess the quality of our diversity measure and compare it to other phylogenetic diversity measures, as well as the actual feature diversity, it would be helpful to output not only one, but all best-scoring solutions, and maybe even some sub-optimal solutions. To accommodate this, we store solutions in a sorted array that retains a given number of solutions and automatically discards solutions with a lower score.

Implementation of Greedy Selection.

Once the values ψ𝒩(e,Z) are computed for all e, we set up a structure mapping each node u of the network to a list of its children v sorted by diversity attainable by saving a leaf with a tree-path from v that avoids generator nodes. Then, the best entry is selected, the pointers are followed down to the best leaf , and the structure is updated from  upwards until a generator node is encountered. While we incur a factor of 𝒪(logΔ𝒩) (where Δ𝒩 is the maximum out-degree in the network), this degenerates to 𝒪(1) since our input networks are binary. Note that we avoid explicitly constructing the forest on which we run the greedy algorithm.

5.3 Experimental Results

While the theoretical running time of 𝒪(26r2nk) may not seem convincing, the implementation performs satisfactorily in practice, possibly due to the heuristic improvements discussed in Section 5.2.

Figure 7: Plots of the observed running time in dependence of the reticulation number r (with k=5 fixed), and of the solution size k (with n=60 and r=12 fixed).

Reticulation number 𝒓.

Figure 7 (left) shows that the observed dependence on the reticulation number r for k=5 behaves very similarly to the reference function x5/500. For k=5, this seems to confirm our hypothesis that the running time can be bounded in 𝒪(rk) on most networks. While the real-time setting (up to 17ms per run) ends rather early at r12, we can cover up to r55 in under 5 minutes, even for networks with many leaves (n=500).

Solution size 𝒌.

Figure 7 (right) shows a clear partition of growth scenarios for the running time dependent on k with n=60 and r=12 fixed: For kr, the running time grows exponentially, roughly proportional to 4kr0.56k but for kr, its growth behaves as a linear function in k. Again, this is characteristic for functions in 𝒪(min{rk,26r}nk).

Each of the 18 instances with n=60, r=12 finished within 8 seconds for each k[1,40].

6 Discussion

Multiple formulations of the concept of phylogenetic diversity on phylogenetic networks have been proposed recently, and it is not yet clear which variants are most relevant in practice, for example for conservation biology. In our opinion, the average-tree phylogenetic diversity score (APD) appears to be a very natural and biologically relevant generalization of phylogenetic diversity on trees. In addition, in the small toy example in Figure 1, APD seems to make a more reasonable choice than NetworkPD by selecting the hybrid species. However, to properly compare the different variants of phylogenetic diversity and to compare them to the feature diversity [32], extensive simulations and multiple case studies are needed. We believe that this is a very important topic for further research.

From a computational point of view, an interesting open problem that remains is whether our FPT algorithm for the maximization version of APD can be improved to a parameterized algorithm with respect to the level of the network. This would be interesting since the level can be much smaller than the current parameter, the number of reticulations. We conjecture that, in contrast to NetworkPD (which is NP-hard for level-1 networks [27]), this is possible for APD. Another interesting direction would be to extend the algorithm to nonbinary networks.

Finally, while the theoretical running time of our algorithm is 𝒪(26rkn), experiments suggest that the practical running time behaves like 𝒪(min{rk,4r}kn) on networks generated by evolutionary processes. It could be interesting to study this further.

References

  • [1] Eric Bapteste, Leo van Iersel, Axel Janke, Scot Kelchner, Steven Kelk, James O. McInerney, David A. Morrison, Luay Nakhleh, Mike Steel, Leen Stougie, and James Whitfield. Networks: expanding evolutionary thinking. Trends in Genetics, 29(8):439–441, 2013.
  • [2] Magnus Bordewich, Charles Semple, and Andreas Spillner. Optimizing phylogenetic diversity across two trees. Applied Mathematics Letters, 22(5):638–641, 2009. doi:10.1016/j.aml.2008.05.004.
  • [3] Magnus Bordewich, Charles Semple, and Kristina Wicke. On the complexity of optimising variants of phylogenetic diversity on phylogenetic networks. Theoretical Computer Science, 917:66–80, 2022. doi:10.1016/J.TCS.2022.03.012.
  • [4] Olga Chernomor, Steffen Klaere, Arndt von Haeseler, and Bui Quang Minh. Split Diversity: Measuring and Optimizing Biodiversity Using Phylogenetic Split Networks, volume 14, pages 173–195. Springer Cham, 2016.
  • [5] Tomás M. Coronado, Gabriel Riera, and Francesc Rosselló. An interchange property for the rooted phylogenetic subnet diversity on phylogenetic networks. Journal of Mathematical Biology, 89(5):48, 2024.
  • [6] Paul Dagum and Michael Luby. Approximating the permanent of graphs with large factors. Theoretical Computer Science, 102(2):283–305, 1992. doi:10.1016/0304-3975(92)90234-7.
  • [7] W. Ford Doolittle. Phylogenetic Classification and the Universal Tree. Science, 284(5423):2124–2128, 1999. doi:10.1126/science.284.5423.2124.
  • [8] Daniel P. Faith. Conservation evaluation and phylogenetic diversity. Biological Conservation, 61(1):1–10, 1992.
  • [9] Philippe Gambette, Vincent Berry, and Christophe Paul. The Structure of Level-k Phylogenetic Networks. In Proceedings of the 20th Annual Symposium on Combinatorial Pattern Matching (CPM 2009), pages 289–300. Springer, 2009. doi:10.1007/978-3-642-02441-2_26.
  • [10] Klaas Hartmann, Dennis Wong, and Tanja Stadler. Sampling Trees from Evolutionary Models. Systematic Biology, 59(4):465–476, 2010.
  • [11] Niels Holtgrefe, Leo van Iersel, Ruben Meuwese, Yuki Murakami, and Jannik Schestag. PANDA: Maximizing Phylogenetic Diversity in Network. Manuscript in preparation, 2025.
  • [12] Daniel H. Huson, Regula Rupp, and Celine Scornavacca. Phylogenetic networks: Concepts, Algorithms and Applications. Cambridge University Press, 2010.
  • [13] Laura Jetten and Leo van Iersel. Nonbinary Tree-Based Phylogenetic Networks. IEEE/ACM transactions on computational biology and bioinformatics, 15(1):205–217, 2016.
  • [14] Mark Jones and Jannik Schestag. How Can We Maximize Phylogenetic Diversity? Parameterized Approaches for Networks. In Proceedings of the 18th International Symposium on Parameterized and Exact Computation (IPEC 2023), pages 30:1–30:12. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2023. doi:10.4230/LIPICS.IPEC.2023.30.
  • [15] Mark Jones and Jannik Schestag. Maximizing Phylogenetic Diversity under Time Pressure: Planning with Extinctions Ahead. arXiv preprint arXiv:2403.14217, 2024. doi:10.48550/arXiv.2403.14217.
  • [16] Mark Jones and Jannik Schestag. Parameterized Algorithms for Diversity of Networks with Ecological Dependencies. Manuscript in preparation, 2025.
  • [17] Joshua A. Justison, Claudia Solis-Lemus, and Tracy A. Heath. SiPhyNetwork: An R package for simulating phylogenetic networks. Methods in Ecology and Evolution, 14(7):1687–1698, 2023.
  • [18] Elizabeth Kolbert. The sixth extinction: an unnatural history. Henry Holt and Company, New York, 2014.
  • [19] Christian Komusiewicz and Jannik Schestag. Maximizing Phylogenetic Diversity under Ecological Constraints: A Parameterized Complexity Study. In Proceedings of the 44th IARCS Annual Conference on Foundations of Software Technology and Theoretical Computer Science (FSTTCS 2024). Schloss-Dagstuhl-Leibniz Zentrum für Informatik, 2024.
  • [20] Binh T. Nguyen, Andreas Spillner, and Vincent Moulton. Computing Phylogenetic Diversity for Split Systems. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 5(02):235–244, 2008. doi:10.1109/TCBB.2007.70260.
  • [21] Fabio Pardi and Nick Goldman. Species Choice for Comparative Genomics: Being Greedy Works. PLoS Genetics, 1(6):e71, 2005. doi:10.1371/journal.pgen.0010071.
  • [22] Roseli Pellens and Philippe Grandcolas. Biodiversity Conservation and Phylogenetic Systematics: preserving our evolutionary heritage in an extinction crisis. Springer Nature, 2016.
  • [23] Jannik Schestag. Weighted Food Webs Make Computing Phylogenetic Diversity So Much Harder. Submitted for publication, 2025.
  • [24] Jannik Schestag. Who Should Have a Place on the Ark? Parameterized Algorithms for the Maximization of the Phylogenetic Diversity. Doctoral dissertation, Friedrich Schiller University Jena, 2025. URL: https://www.fmi.uni-jena.de/fmi_femedia/33737/dissertation-schestag-25-pdf.pdf.
  • [25] Mike Steel. Phylogenetic Diversity and the Greedy Algorithm. Systematic Biology, 54(4):527–529, 2005.
  • [26] Scott A. Taylor and Erica L. Larson. Insights from genomes into the evolutionary importance and prevalence of hybridization in nature. nature ecology & evolution, 3(2):170–177, 2019.
  • [27] Leo van Iersel, Mark Jones, Jannik Schestag, Celine Scornavacca, and Mathias Weller. Phylogenetic Network Diversity Parameterized by Reticulation Number and Beyond. In Proceedings of the 23rd RECOMB International Workshop on Comparative Genomics (RECOMB-CG 2025), Seoul, Republic of Korea, 2025.
  • [28] Leo van Iersel, Judith Keijsper, Steven Kelk, Leen Stougie, Ferry Hagen, and Teun Boekhout. Constructing Level-2 Phylogenetic Networks from Triplets. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 6(4):667–681, 2009. doi:10.1109/TCBB.2009.22.
  • [29] Leo van Iersel, Sjors Kole, Vincent Moulton, and Leonie Nipius. An algorithm for reconstructing level-2 phylogenetic networks from trinets. Information Processing Letters, 178:106300, 2022. doi:10.1016/j.ipl.2022.106300.
  • [30] Mathias Weller. phylo_tools (c++20 branch). https://github.com/igel-kun/phylo_tools/tree/C%2B%2B20, 2025. Accessed: 2025-05-10.
  • [31] Kristina Wicke and Mareike Fischer. Phylogenetic diversity and biodiversity indices on phylogenetic networks. Mathematical Biosciences, 298:80–90, 2018.
  • [32] Kristina Wicke, Arne Mooers, and Mike Steel. Formal Links between Feature Diversity and Phylogenetic Diversity. Systematic Biology, 70(3):480–490, 2021.

Appendix A Appendix

A.1 Proof of Lemma 4.6

Even though only parts of the proof of Lemma 4.6 have been deferred to the appendix, we present the entire proof here.

Lemma A.1.

Let 𝒮𝒮(𝒩) be a subset of sides and let (D,F) be derived from (𝒩,𝒮) according to Construction 1. Then for ZX, the signature of Z is SigZ=𝒮 if and only if Z respects F. Moreover, if SigZ=𝒮, then APD𝒩(Z)=PDF(Z)+D.

Proof.

Observe that by construction, each tree in F corresponds to a side S𝒮, and as such Z respects F if and only if SigZ=𝒮.

So now assume that SigZ=𝒮. We prove that APD𝒩(Z)=PDF(Z)+D.

Note that every arc in F has a corresponding arc in 𝒩. For ease of notation, we speak of the same arc e as existing in F and 𝒩. Let ω𝒩(e) denote the weight of an arc e in 𝒩, and let ωF denote the weight of e in F. For any arc e that is not a lowest path-arc in 𝒩, let χF(e,Z):=1 if e has an offspring from Z in F, and let χF(e,Z):=0 otherwise. Observe that PDF(Z)=eA(F)ωF(e)χF(e,Z).

Recall that ψ𝒩(e,Z) denotes the probability that, taken some switching σ, the arc e has an offspring from Z in Tσ. By Lemma 4.3 we know APD𝒩(Z)=eA(𝒩)ω𝒩(e)ψ𝒩(e,Z).

Claim A.2.

For an arc side S with lowest path-arc eS, and e a path-arc of S that is not eS, if S𝒮 then ω𝒩(e)ψ𝒩(e,Z)=ωF(e)χF(e,Z)+ω𝒩(e)ψ𝒩(eS,𝒮) and, otherwise, ω𝒩(e)ψ𝒩(e,Z)=ω𝒩(e)ψ𝒩(eS,𝒮).

Proof.

First suppose that S𝒮 and χF(e,Z)=1. Then e has an offspring zZ which is a leaf of side S. As there are no reticulations on the path between e and z in 𝒩, it holds that ψ𝒩(e,Z)=1 and so ω𝒩(e)ψ𝒩(e,Z)=ω𝒩(e). On the other hand,

ωF(e)χF(e,Z)+ω𝒩(e)ψ𝒩(eS,𝒮) =ωF(e)+ω𝒩(e)ψ𝒩(eS,𝒮)
=(1ψ𝒩(eS,𝒮))ω𝒩(e)+ω𝒩(e)ψ𝒩(eS,𝒮)
=ω𝒩(e)

and so the claim holds.

Next suppose that χF(e,Z)=0. Then e has no offspring from side S in 𝒩. It follows that e has an offspring from Z in Tσ if and only if eS has an offspring from Z in Tσ, for any switching σ. Thus ψ𝒩(e,Z)=ψ𝒩(eS,Z)=ψ𝒩(eS,𝒮). But then ωF(e)χF(e,Z)+ω𝒩(e)ψ𝒩(eS,𝒮)=ω𝒩(e)ψ𝒩(eS,𝒮)=ω𝒩(e)ψ𝒩(e,Z), and so the claim holds. A similar argument holds for the case that S𝒮.

Claim A.3.

For a leaf-arc e=vx with x on side S, it holds that ω𝒩(e)ψ𝒩(e,Z)=ωF(e)χF(e,Z) if S𝒮, and ψ𝒩(e,Z)=0, otherwise.

Proof.

Observe that ψ𝒩(e,Z)=1 if and only if xZ, and ψ𝒩(e,Z)=0 otherwise. Since Z has signature 𝒮, it follows that ψ𝒩(e,Z)=0 if S𝒮. On the other hand if S𝒮, then χF(e,Z)=1 if and only if xZ and so ψ𝒩(e,Z)=χF(e,Z). The claim then follows from the fact that ω𝒩(e)=ωF(e) when e is a leaf-arc.

Recall that, for any lowest path-arc eS on side S, it holds that ψ𝒩(eS,Z)=ψ𝒩(eS,𝒮) by definition. Let 𝒜𝒮(𝒩) denote the arc sides on 𝒩 and let 𝒮(𝒩) denote the reticulation sides. Note that 𝒮(𝒩)=𝒜𝒮(𝒩)𝒮(𝒩). By construction it is D=S𝒜𝒮(𝒩)ω𝒩(eS)ψ(eS,𝒮).

To prove the main claim, let E1 denote the leaf-arcs of 𝒩, let E2 denote the path-arcs which are not the lowest path-arc on their side, and let E3 denote the arcs which are the lowest path-arc of their side. For each i{1,2,3}, let Ei1 denote the arcs in Ei belonging to a side in 𝒮, and let Ei0=EiEi1. Note that the arc set of F is E11E21. For each arc eE2E3, let eS denote the lowest path-arc on the side containing e (note that if eE3 then eS=e).

APD𝒩(Z)= eA(𝒩)ω𝒩(e)ψ𝒩(e,Z)
= eE11E10E21E20E3ω𝒩(e)ψ𝒩(e,Z)
= eE11ωF(e)χF(e,Z)+0+E21E20E3ω𝒩(e)ψ𝒩(e,Z)
= eE11ωF(e)χF(e,Z)+eE21(ωF(e)χF(e,Z)+ω𝒩ψ𝒩(eS,𝒮))
+E20E3ω𝒩(e)ψ𝒩(e,Z)
= eE11ωF(e)χF(e,Z)+eE21ωF(e)χF(e,Z)+eE21ω𝒩ψ𝒩(eS,𝒮)
+E20ω𝒩(e)ψ𝒩(eS,𝒮)+E3ω𝒩(e)ψ𝒩(eS,𝒮)
= eE11E21ωF(e)χF(e,Z)+eE21E20E3ω𝒩ψ𝒩(eS,𝒮)
= eA(F)ωF(e)χF(e,Z)+eE21E20E3ω𝒩ψ𝒩(eS,𝒮)=PDF(Z)+D

A.2 Proof of Lemma 4.9

Lemma A.4.

Max-PD can be solved in 𝒪(nk) time on forests.

Proof.

Let :=(F,k,D) be an instance of Max-PD with forest F={T1,,Ts} where ρi is the root of Ti. Let M be an integer bigger than ω(A(F)).

We define a tree T with root ρ, as the union of T1,,Ts with additional arcs ρρi of weight M for each i{1,,s}. Solve the instance :=(T,k,D+sM) of Max-PD with Faith’s greedy algorithm and return the result.

Let ZX be a solution for . Because Z respects F, there is a taxon xiZ beneath ρi. Consequently, ρρi is on the path from ρ to xi and so PDT(Z)=sM+PDF(Z)sM+D.

Conversely, let ZX be a solution for . As PDTD+sM, we conclude that Z contains a taxon of each tree and therefore respects F. Therefore, Z is a solution for .

is constructed in linear time. On trees, Max-PD is solved in 𝒪(nk) time [25, 21].