Which Phylogenetic Networks Are Level- Networks with Additional Arcs? Structure and Algorithms
Abstract
Reticulate evolution gives rise to complex phylogenetic networks, making their interpretation challenging. A typical approach is to extract trees within such networks. Since Francis and Steel’s seminal paper, “Which Phylogenetic Networks are Merely Trees with Additional Arcs?” (2015), tree-based phylogenetic networks and their support trees (spanning trees with the same root and leaf-set as a given network) have been extensively studied. However, not all phylogenetic networks are tree-based, and for the study of reticulate evolution, it is often more biologically relevant to identify support networks rather than trees. This study generalizes Hayamizu’s structure theorem, which yielded optimal algorithms for various computational problems on support trees of rooted almost-binary phylogenetic networks, to extend the theoretical framework for support trees to support networks. This allows us to obtain a direct-product characterization of each of three sets: all, minimal, and minimum support networks, for a given network. Each characterization yields optimal algorithms for counting and generating the support networks of each type. Applications include a linear-time algorithm for finding a support network with the fewest reticulations (i.e., the minimum tier). We also provide exact and heuristic algorithms for finding a support network with the minimum level, both running in exponential time but practical across a reasonably wide range of reticulation numbers.
Keywords and phrases:
Phylogenetic networks, Support networks, Level- networks, Tier- networks, Structure theorem, Enumeration, OptimizationFunding:
Takatora Suzuki: JST SPRING Program Grant Number JPMJSP2128, Japan.Copyright and License:
2012 ACM Subject Classification:
Applied computing Biological networksSupplementary Material:
Software (Source Code, Datasets and Detailed Experimental Results): https://github.com/hayamizu-lab/structure-support-networksarchived at
swh:1:dir:a40004b85200c9d46525feb46dc740a1f83ab3eb
Acknowledgements:
The authors thank Tsuyoshi Urata, Haruki Miyaji, and Yukihiro Murakami for their discussions, Manato Yokoyama for help with the GitHub repository, and anonymous reviewers for their careful reading and useful comments.Editors:
Broňa Brejová and Rob PatroSeries and Publisher:
Leibniz International Proceedings in Informatics, Schloss Dagstuhl – Leibniz-Zentrum für Informatik
1 Introduction
Evolutionary histories involving reticulate events, such as horizontal gene transfer, hybridization, and recombination, are better represented by networks than trees. However, networks can be far more complex than trees, making their interpretation challenging. A typical approach is to find meaningful subgraphs, particularly spanning trees, of these networks.
Since Francis and Steel’s seminal paper “Which Phylogenetic Networks are Merely Trees with Additional Arcs?” [8], tree-based networks and their support trees (spanning trees with the same root and leaf-set as the network, also known as subdivision trees) have been extensively studied (e.g., [9, 10, 16]). Hayamizu [9] developed optimal (linear-time or linear-delay) algorithms for various computational problems, including counting, listing, and optimization of support trees, by establishing a theoretical foundation called a structure theorem for rooted almost-binary phylogenetic networks, a class encompassing binary ones. This theorem provides a canonical way to decompose any such network into its unique maximal zig-zag trails, and characterizes the family of edge-sets of support trees of the network.
While tree-based phylogenetic networks encompass many well-studied subclasses such as tree-child [1], stack-free [14], and orchard networks [4, 12], networks inferred from biological data are not necessarily tree-based. This has led to increased interest in non-tree-based networks (e.g., [2, 6, 7, 15]). Indeed, for the study of reticulate evolution, it is often more biologically relevant to consider networks within networks rather than trees. In such contexts, finding a most concise subgraph, such as one with the fewest reticulations (minimum tier) or with the minimum level, is particularly meaningful. These metrics can be interpreted as measures of deviation from being tree-based [7, 6], as they equal zero for tree-based networks.
In this paper, we generalize Hayamizu’s structure theorem to build a theoretical framework for support networks of rooted almost-binary phylogenetic networks. Building on the maximal zig-zag trail decomposition established in [9], we derive direct-product characterizations of three families of support networks for a given network : all support networks , minimal support networks , and minimum support networks . These characterizations yield closed-form product formulas for their cardinalities, revealing unexpected connections to Fibonacci, Lucas, Padovan, and Perrin numbers. We present a linear-time algorithm for counting each of , , and , as well as a linear-delay algorithm to list the support networks in each family.
Our theoretical results lead to practical algorithms for two key optimization problems: Problem 1 (Reticulation Minimization) and Problem 2 (Level Minimization). Problem 1 asks for a support network with the minimum reticulation number (minimum tier). We present a linear-time optimal algorithm for solving Problem 1, which can be achieved by selecting any element from . Problem 2 asks for a support network with the minimum level. We conjecture that this problem is NP-hard and present exact and heuristic algorithms for solving Problem 2 (Algorithms 1 and 2). Although they are both exponential-time algorithms, they are practical for networks with a reasonably wide range of reticulation numbers. The heuristic is particularly scalable and accurate in most cases.
The remainder of the paper is organized as follows. In Section 2, we define graph theoretical terminology and review the relevant materials on support trees (Section 2.1) and the structure theorem (Section 2.2). In Section 3, we define support networks, their families , , and , and relevant concepts. Section 4 provides characterizations and counting formulas for each family. Numerical results are also presented in Section 4.4. In Section 5, we give a linear-time algorithm for solving Problem 1. Section 6 presents the exact and heuristic algorithms for Problem 2, along with performance evaluations using synthetic data. Section 7 concludes with a summary of our contributions and directions for future research.
2 Preliminaries
The graphs in this paper are finite, simple (i.e. having neither loops nor multiple edges), acyclic directed graphs, unless otherwise stated. For a graph , and denote the sets of vertices and edges of , respectively. For two graphs and , is a subgraph of if both and hold, in which case we write . Two graphs and are isomorphic, denoted by , if there exists a bijection such that if and only if for all . A subgraph of is proper if . A subgraph of is a spanning subgraph of if .
Given a graph and a non-empty subset , the edge-set is said to induce the subgraph of , that is, the one whose edge-set is and whose vertex-set is the set of the ends of all edges in . For a graph with and a partition of , the collection is a decomposition of , and is said to be decomposed into . Here we recall that a partition of a set is a collection of pairwise disjoint non-empty subsets whose union is the entire set.
For an edge of a graph , and are denoted by and , respectively. For a vertex of a graph , the in-degree of in , denoted by , is the cardinality of . The out-degree of in , denoted by , is defined in a similar way. For any graph , a vertex with is called a leaf of . Subdividing an edge means replacing it with a directed path from to of length at least two. Smoothing a vertex where means suppressing from , namely, the reverse operation of edge subdivision.
An undirected graph is connected if there is a path between every pair of vertices. For a connected simple undirected graph , a cut vertex (resp. cut edge) of is a vertex (resp. edge) whose removal disconnects , and a block of is a maximal connected subgraph of that contains no cut vertex. In this paper, a block of a directed graph refers to a block of its underlying undirected graph.
2.1 Phylogenetic networks and support trees
Suppose represents a non-empty finite set of present-day species. A rooted almost-binary phylogenetic network (on a leaf-set ) is defined to be a finite simple directed acyclic graph with the following properties (P1)–(P3). The vertex is called the root of , and any vertex with is called a reticulation of .
- (P1)
-
There exists a unique vertex of with and ;
- (P2)
-
The set of leaves of is identical to ;
- (P3)
-
For any , and ;
The above definition allows to contain a vertex with or . In this sense, it slightly generalizes the notion of rooted binary phylogenetic networks on , which is defined by (P1), (P2) and (P4). When has the properties (P1), (P2) and (P5), is particularly called a rooted binary phylogenetic tree on .
- (P4)
-
For any , .
- (P5)
-
For any , .
While the concept of support trees was originally defined for rooted binary phylogenetic networks in [8], the theoretical framework developed in [9], which includes the structure theorem and associated algorithms and forms the foundation of the present work, applies equally to rooted almost-binary networks. Therefore, in this paper, following the approach taken in [15], we will consider support trees of almost-binary (including binary) networks, unless stated otherwise. For brevity, we omit “rooted almost-binary” when no confusion is likely to arise.
Definition 1.
Let be a rooted almost-binary phylogenetic network on . If there exists a rooted binary phylogenetic tree on and a spanning tree of such that is obtained by smoothing all with , then is called a tree-based network on . In this case, is called a support tree of and a base tree of .
As a support tree is a spanning tree of , each is specified by its edge-set. This leads to the natural question: Which subsets of yield a support tree of ? Francis and Steel [8] proved that such “admissible” subsets of are characterized by the following conditions (C1)–(C3). As in [9], we slightly generalize the original definition in [8] so that we can consider admissible edge selections for any subgraph of .
Definition 2.
Let be a rooted almost-binary phylogenetic network and let be any subgraph of . A subset of is admissible if it satisfies the following conditions:
- (C1)
-
If is an edge of with or , then contains .
- (C2)
-
If and are distinct edges of with , then contains at least one of .
- (C3)
-
If and are distinct edges of with , then contains exactly one of .
Theorem 3 (almost-binary version of Theorem 1(a) in [8]).
Let be a rooted almost-binary phylogenetic network and let . Then, the subgraph of induced by is a support tree of if and only if is an admissible subset of . Moreover, there exists a one-to-one correspondence between the family of admissible subsets of and the family of support trees of .
Remark 4.
In the setting where the internal vertices of (those except the root and leaves) are unlabeled and thus indistinguishable from one another, two different admissible subsets of can induce isomorphic support trees . Due to this subtlety, multiple variations of the support tree counting problem have been explicitly formulated (see Fig. 1 and Section 7.2 in [9]). Therefore, a more precise interpretation of Theorem 3 may be that it establishes a one-to-one correspondence between the family of admissible subsets of and the family of support trees of , under the assumption that all vertices (or edges) of are distinguishable. In this paper, we adopt this assumption to obtain generalizations of Theorem 3.
2.2 Structure theorem for rooted almost-binary phylogenetic networks
We recall the relevant materials from [9, 15]. Given a rooted almost-binary phylogenetic network , a connected subgraph of with is a zig-zag trail (in ) if the edges of can be permuted as such that either or holds for each . A zig-zag trail is represented by an alternating sequence of (not necessarily distinct) vertices and distinct edges, e.g., , but this can be more concisely written as or its reverse, where each edge is represented by . A zig-zag trail in is maximal if contains no zig-zag trail such that is a proper subgraph of . Each maximal zig-zag trail in falls into one of the following four types. A crown is a maximal zig-zag trail that has even and can be written in the cyclic form . A fence is any maximal zig-zag trail that is not a crown. An N-fence is a fence with odd , expressed as . A fence with even is an M-fence if expressed as , and it is a W-fence if expressed as .
As established in [9] (and will be restated in Theorem 8), any rooted almost-binary phylogenetic network admits a unique decomposition into its maximal zig-zag trails (), called the zig-zag trail decomposition of (see Figure 1 for an illustration). To understand how this decomposition arises, observe that every edge of belongs to an obvious zig-zag trail in . If has out-degree two in , or has in-degree two in , then this trail is not maximal. In such cases, by repeatedly extending the zig-zag trail, one can eventually obtain a maximal one. By construction, if each vertex of has in-degree and out-degree at most two, then a maximal zig-zag trail containing is unique for each . In other words, if is almost-binary, then no two distinct maximal zig-zag trails in have a common edge. It follows that is a partition of , and thus is indeed a decomposition of .
Remark 5.
If is a maximal zig-zag trail in , then, the following is equivalent to condition (C1) in Definition 2: If is an edge of with or , then contains . In fact, by the maximality of , we have if and only if holds. Similarly, and are equivalent.
Proposition 6 ([9]).
The maximal zig-zag trail decomposition of can be computed in time.
As in the approach taken in [9, 10], we often consider a maximal zig-zag trail as a sequence of edges, ordered according to their appearance in the trail, where and are called the terminal edges of when is a fence. For any maximal zig-zag trail in , any subset of is specified by a - sequence , where if and otherwise. For example, given a crown , a subset of can be encoded as . We often write to avoid a repetition such as .
For the reader’s benefit, we now illustrate how to find admissible subsets using small examples. Let be the maximal zig-zag trail decomposition of . Suppose is an N-fence with . Then, is the only admissible subset of because (C1) requires inclusion of both terminal edges and , (C3) excludes selecting both and , and (C2) then requires , which in turn excludes due to (C3). Thus, the family of admissible subsets of consists of a single element , and so . Next, suppose is a crown. Then, and are the only two admissible subsets of , because they are the only subsets that satisfy both (C2) and (C3), and crowns have no edge subject to (C1). Thus, .
Remark 7.
The choice of an admissible subset from is independent of and does not influence the selection concerning any other . Indeed, Remark 5 implies that condition (C1) merely requires both terminal edges of each fence in be always selected (i.e. be marked by ). In addition, (C2) and (C3) are relevant only to a fence or crown that has a pair of edges expressed as and , respectively. Recalling that the maximal zig-zag trails are edge-disjoint, whenever such a pair of edges exists, it is contained in a unique maximal zig-zag trail in . Thus, one can check whether is admissible or not even without looking at the entire network .
Remark 7 provides a key idea behind the direct product decomposition in Theorem 8. After obtaining for , one can construct an admissible subset of simply by computing , where is any admissible subset of . The family of support trees of – the family of admissible subsets of – by listing all such combinations with each .
Using the definitions, notation, and key ideas outlined above, we now state the structure theorem for rooted almost-binary phylogenetic networks. It provides a way to canonically decompose such networks and characterizes the family of admissible subsets of , namely, the family of edge-sets of support trees of .
Theorem 8 ([9]).
For any rooted almost-binary phylogenetic network , the following hold:
-
1.
The maximal zig-zag trail decomposition of is unique to .
-
2.
Given , is a support tree of if and only if for each , is an admissible subset of . Here, the family of admissible subsets of is given by Equation (1).
-
3.
The family of support trees of is non-empty (i.e., is tree-based) if and only if each is non-empty (i.e., no element of is a W-fence). When is non-empty, it is characterized by , where is an arbitrary ordering for .
| (1) |
3 The family of support networks and its two subfamilies
For a rooted (not-necessarily-binary) phylogenetic network , denotes the reticulation number of , i.e., the number of reticulations in , and denotes the level of , i.e., the maximum number of reticulations of contained in a block of , and the tier is . By definition, holds. In general, holds. If is almost-binary, then holds by the hand-shaking lemma for directed graphs, which states that the sum of the in-degrees over all vertices equals the number of edges gives. Since is assumed to be almost-binary in this paper, we will use the terms “tier” and “reticulation number” interchangeably.
Let be a rooted almost-binary phylogenetic network on and let be a spanning subgraph of . If is also a rooted almost-binary phylogenetic network on , then is a support network of . The base network of is obtained from by smoothing all vertices with . Unlike support and base trees, support and base networks always exist since itself is a support network of . The base tier of , denoted by , is the minimum value of , which equals , over all support networks of . Similarly, the base level of , denoted by , is the minimum value of , which equals , over all support networks of . If is the base tier (resp. base level) of , then is tier--based (resp. level--based).
For unrooted phylogenetic networks, Fischer and Francis [6] has introduced the concept of support networks to discuss tier--based and level--based networks. While the definitions are analogous, the mathematical properties of these concepts differ between rooted and unrooted networks. Indeed, for unrooted , it was shown in [6] that holds if has at most one “non-trivial blob” – a maximal connected subgraph without a cut edge and with at least two vertices. For rooted , however, this equality does not hold in general (see Figure 2 for a counterexample).
The focus of this paper is on the computation of and for rooted , with particular emphasis on the latter which presents greater computational challenges. We approach these problems using the families of minimal and minimum support networks of , which are defined as follows:
Definition 9.
For a rooted almost-binary phylogenetic network ,
-
is the set of all support networks of ;
-
is the set of minimal support networks of (i.e. those with a minimal edge-set);
-
is the set of minimum support networks of (i.e. those with the fewest edges).
Definition 9 implies . As Figure 3 shows, these families can be all distinct. We also note that if is tree-based, then is nothing but the set of support trees of characterized in Theorems 3 and 8.
4 Counting the support networks of three types
Examining the proper subfamilies and rather than the entire family of support networks could reduce the search space for computing or . A natural question is how substantial this reduction might be. In this section, we generalize Theorem 8 to derive analogous direct-product characterizations of , , and , and determine their cardinalities. To achieve this, we relax condition (C3) in Definition 2 to introduce different notions of admissibility as follows.
Definition 10.
Let be a rooted almost-binary phylogenetic network and let be any subgraph of . A subset of is -admissible if it satisfies the following conditions:
- (C1⋆)
-
If is an edge of with or , then contains ;
- (C2⋆)
-
If and are distinct edges of with or , then contains at least one of .
In particular, given an -admissible subset of , is -admissible if it is minimal, i.e. no proper subset of is an -admissible subset of , and is -admissible if it is smallest among all -admissible subsets of .
The definition of -admissibility differs from the previous admissibility (Definition 2) in a single aspect: it relaxes the original condition (C3). In fact, (C1⋆) is identical to the original condition (C1). Similarly to Remark 5, when is a maximal zig-zag trail in , we may replace and in condition (C1⋆) with and , respectively. The new condition (C2⋆) is an amalgamation of the original condition (C2) and a relaxed version of (C3). In other words, (C2⋆) retains the original requirement for edges sharing the same tail, while it replaces the “exactly one” requirement for edges with the same head by a weaker “at least one”. This modification of (C3) allows an , , and -admissible subsets of to contain both incoming edges at a reticulation vertex and to induce a non-tree subgraph of with a reticulation.
Lemma 11 generalizes Theorem 3 and the second statement of Theorem 8. The proof is given in Appendix A.1.
Lemma 11.
Let be a rooted almost-binary phylogenetic network, and let . Then, there is a one-to-one correspondence between the family of support networks and the family of -admissible subsets of . Moreover, the subgraph of induced by is a support network in if and only if is an -admissible subset of for each maximal zig-zag trail in .
4.1 The number of all support networks
By Lemma 11, one can find an -admissible subset of in essentially the same manner as for an admissible subset of : first, compute the zig-zag trail decomposition of , and then select edges of according to the conditions (C1⋆) and (C2⋆) in Definition 10. As explained in Remark 7, when is a fence, (C1⋆) requires that the terminal edges and be marked . In contrast, new condition (C2⋆) prohibits unselected consecutive edges of – consecutive ’s in the - sequence – regardless of whether is a crown or a fence.
Thus, the family of -admissible subsets of is expressed by (3) for each . We note that, if is a crown, different choices of result in different sequences representing the same subset of . For example, both and represent the same subset that does not satisfy condition (C2⋆).
| (3) |
Moreover, by Lemma 11, is characterized by , similarly to in Theorem 8. Therefore, . As in (3), each number depends on whether is a fence or not. This number can be more explicitly represented using Fibonacci and Lucas numbers (OEIS A000045 and OEIS A000032, respectively) as follows. The proof of Theorem 12 is given in Appendix A.2.
Theorem 12.
Let be a rooted almost-binary phylogenetic network and be the maximal zig-zag trail decomposition of . Let be the Fibonacci sequence defined by and (), and be the Lucas sequence defined by and (). Then, the number of support networks is given by (4).
| (4) |
Moreover, holds, where is the golden ratio.
Theorem 12 yields an obvious algorithm for counting . It first computes the maximal zig-zag trail decomposition of , and then calculates using (4).
Proposition 13.
can be computed by in time.
Proof.
4.2 The number of minimal support networks
We can see that -admissible subset of is -admissible if and only if contains no three consecutive edges in the edge sequence of . Thus, the family of -admissible subsets of is expressed as in (5) for each .
| (5) |
By Lemma 11, is characterized by . Therefore, . This number is expressed using Padovan numbers (OEIS A000931) and Perrin numbers (OEIS A001608) as follows. The proof of Theorem 14 is in Appendix A.3.
Theorem 14.
Let be a rooted almost-binary phylogenetic network and be the maximal zig-zag trail decomposition of . Let be the Padovan sequence defined by and (), and be the Perrin sequence defined by and (). Then, the number of minimal support networks is given by (6).
| (6) |
Moreover, holds, where is the plastic number.
Proposition 15.
can be computed in time.
Proof.
The only difference between (4) and (6) is that and are replaced by and , respectively. For each , both and can also be computed in time. The remainder is the same as the proof of Proposition 13.
As noted in Section 4.1, the elements of can also be listed with delay.
4.3 The number of minimum support networks
The construction of -admissible subsets of each is the same as that of admissible subsets for support trees, with the only difference that -admissible subsets allow W-fences.
| (7) |
By Lemma 11, is characterized by . Therefore, . From (7), we have for any crown , for any N-fence , and for any other . Thus, we obtain Theorem 16 and Proposition 17.
Theorem 16.
Let be a rooted almost-binary phylogenetic network and be the maximal zig-zag trail decomposition of . Then, holds, where is the number of crowns in , and .
Proposition 17.
can be computed in time.
As noted in Section 4.1, the elements of can also be listed with delay.
4.4 Case study: Comparison of the numbers , , and
One can solve any optimization problem on support networks by listing all elements of , but this approach is generally infeasible due to its enormous size. Although all of , , and grow exponentially with the size of , we present a comparison of their practical sizes to motivate our approach to Problem 2 described in Section 6. The full implementation including the code used to generate the networks, along with the generated samples and detailed count data, is available in our GitHub repository.
For this case study, we generated rooted binary phylogenetic networks with leaves and reticulations. Such a network has edges, and thus of its edges are reticulation edges, representing a high level of reticulation. For each , we created 100 such networks by the following procedure: starting from a rooted binary phylogenetic tree , the procedure repeatedly adds a new edge between two arbitrary edges of , directing it so as to preserve the acyclicity of , and returns . Note that networks generated in this manner may or may not be tree-based, since every new edge is added not between edges of the initial tree but between edges of the current network .
For each generated network , we computed , , and using the linear-time algorithms described in Sections 4.1–4.3. The minimum, maximum, and median counts across the 100 samples of each size are summarized in Table 1.
Table 1 shows that is consistently much smaller than , and is even smaller. This indicates that, even though can be prohibitively large, both and are typically small enough to allow exhaustive enumeration. We also see that the number of support networks varies widely even among networks of the same size, suggesting that the computational cost of exhaustive search within or may vary substantially depending on the structure of . For example, the number of minimum support networks of with was 1728 for a certain sample but was only 4 for another sample.
| Min | Max | Median | Min | Max | Median | Min | Max | Median | ||
|---|---|---|---|---|---|---|---|---|---|---|
| 3 | 4 | 15 | 64 | 30 | 2 | 10 | 4 | 1 | 9 | 3 |
| 4 | 6 | 55 | 336 | 160 | 4 | 28 | 9 | 1 | 18 | 4 |
| 5 | 8 | 288 | 2,640 | 900 | 8 | 72 | 24 | 1 | 36 | 8 |
| 6 | 10 | 900 | 15,840 | 4,478 | 6 | 256 | 48 | 1 | 108 | 12 |
| 7 | 12 | 6,435 | 134,784 | 24,720 | 24 | 768 | 142 | 2 | 192 | 24 |
| 8 | 14 | 34,272 | 571,914 | 134,400 | 40 | 1,680 | 270 | 2 | 576 | 38 |
| 9 | 16 | 93,600 | 4,186,080 | 699,920 | 90 | 2,880 | 576 | 2 | 1,152 | 68 |
| 10 | 18 | 449,280 | 21,715,200 | 3,861,936 | 144 | 7,056 | 1,372 | 4 | 1,728 | 124 |
5 Finding a support network with the fewest reticulations
As defined in Section 3, the base tier of is the minimum value of over all support networks of . This leads us to Problem 1.
Problem 1 (Reticulation Minimization).
Given a rooted almost-binary phylogenetic network , compute the base tier , and find a support network of with .
Since and is a spanning subgraph of , the support networks of minimize if and only if it has the fewest edges among all support networks, i.e., it is a minimum support network. Therefore, an optimal solution of Problem 1 can be obtained by simply selecting an arbitrary element from . This is done by computing the maximal zig-zag trail decomposition of , and then selecting an arbitrary element of expressed in (7) for each .
For example, the network in Figure 1 is decomposed into the maximal zig-zag trails through . According to (7), the sequence is -admissible for , , and ; is -admissible for and ; and is -admissible for and . Piecing these -admissible subsets together yields the edge-set of a support network of with the minimum reticulation number, that is, .
Theorem 18.
The above algorithm solves Problem 1 in time.
When is tree-based, holds and a support network with is simply its support tree. In general, equals the number of W-fences in the maximal trail decomposition of , as easily shown by induction on the number of W-fences.
6 Finding a support network with the minimum level
In Section 3, we have also defined , i.e., the base level of . In contrast to Problem 1, we conjecture that Problem 2 is NP-hard. Here, we provide exact and heuristic exponential algorithms for solving Problem 2.
Problem 2 (Level Minimization).
Given a rooted almost-binary phylogenetic network , compute the base level, , and find a support network with .
6.1 Exact algorithm
An obvious exact algorithm for Problem 2 computes of each element in to obtain the minimum value, but this method is clearly impractical. Algorithm 1 performs an exhaustive search in instead of . Recalling the numerical results in Table 1, we know that this search space reduction is significant. We can prove that contains a correct , yielding Theorem 19. The proof is given in Appendix A.4.
6.2 Heuristic algorithm
Algorithm 2 performs an exhaustive search in an even smaller search space . It repeats the same procedure as Algorithm 1, except its iteration number is reduced from to . By Theorem 19, Algorithm 2 therefore runs in time.
Algorithm 2 does not always output a correct solution of Problem 2, because minimizing the number of reticulations does not imply minimizing the level of support networks. Consider a level--based network in Figure 4(a), which has the property that any satisfies and . Then, Algorithm 2 outputs a level- support network as illustrated in Figure 4(b). An optimal support network with exists in while it is not optimal in terms of reticulation numbers, as shown in Figure 4(c).
6.3 Performance evaluation of the exact and heuristic algorithms
We evaluated the performance of our exact algorithm (Algorithm 1) and heuristic algorithm (Algorithm 2) using rooted binary phylogenetic networks with leaves and various reticulation numbers . Although our experiment focuses on only binary networks, this simplification does not undermine the validity of our evaluation, as the search space sizes for Algorithm 1 and for Algorithm 2 are not affected by whether is binary or almost-binary.
For each ranging from 1 to 50, we generated 25 instances using the procedure described in Section 4.4. The experiments were conducted on a MacBook Pro equipped with an Apple M3 Max CPU (4.05 GHz) and 36 GB of memory, with a timeout limit of 30 minutes per instance. The Python code and experimental results, including the outputs of both algorithms for all instances, are available in our GitHub repository.
We first assessed computational efficiency of Algorithms 1 and 2 by measuring runtime on each instance. Figure 5 shows average runtime on a log scale with minimum and maximum values across 25 instances for each . Both algorithms exhibit expected exponential runtime growth as increases. Algorithm 1 completed up to , beyond which timeouts occurred, while Algorithm 2 successfully handled all instances up to . The performance gap is particularly pronounced for larger networks. For example, at , Algorithm 1 approaches its limits while Algorithm 2 completes within seconds on average.
Notably, both algorithms exhibit substantial runtime variability even for the same , as shown by the wide error bars in Figure 5. This variation reflects runtime dependence on network structure beyond just reticulation number, consistent with our observations in Section 4.4 regarding the significant variability of and among instances with identical values.
While Algorithm 1 guarantees optimal solutions (Theorem 19), Algorithm 2 provides no such theoretical guarantee. To evaluate the practical quality of the heuristic solutions, we compared the outputs of Algorithm 2 against the optimal values computed by Algorithm 1 for all instances with where both algorithms completed successfully within the time limit. Figure 6 provides a breakdown of the output differences for each , where is defined as the difference between the value returned by Algorithm 2 and the optimal value computed by Algorithm 1. The bars show the proportion of instances (out of 25) with (exact match), , , and .
Although the accuracy gradually declines as increases, Algorithm 2 maintains practical effectiveness across the entire range of tested networks. In fact, up to , Algorithm 2 achieved exact solutions () for all instances. Even at , it still returned optimal values for approximately 80% of instances. Large deviations () remain rare throughout the tested range, occurring in less than 10% of instances even for the largest networks.
7 Conclusion and future directions
In this paper, we have extended Hayamizu’s structure theorem to develop a theoretical foundation for support networks of rooted almost-binary phylogenetic networks. The extension from support trees allows us to analyze not only tree-based networks but also general, non-tree-based rooted phylogenetic networks. Our main contributions are threefold:
First, we established direct-product characterizations of three families of support networks for a given network : all support networks , minimal support networks , and minimum support networks . These characterizations yielded closed-form counting formulas that can be computed in linear time, revealing interesting connections to well-known integer sequences such as Fibonacci, Lucas, Padovan, and Perrin numbers. We also described and implemented a linear-delay algorithm for listing the support networks in each family.
Second, we developed a linear-time algorithm for Reticulation Minimization, which finds a support network with the minimum reticulation number (minimum tier). We proved that an optimal solution can be found simply by picking any element from the set .
Third, we proposed both exact and heuristic algorithms for Level Minimization. While both algorithms have exponential time complexity, our experimental evaluations demonstrated their practicality across a wide range of reticulation levels. Notably, the heuristic method found optimal solutions in most cases, and when it did not, it still produced good approximations.
There are some promising avenues for future research. In this paper, we have discussed support networks of almost-binary networks, which form a subclass of non-binary networks. However, extending this framework to fully non-binary networks is challenging because the uniqueness of the maximal zig-zag trail decomposition of , a key component of Hayamizu’s structure theorem that underpins our work, is not guaranteed beyond the almost-binary case. It is also important to explore whether support networks with the minimum tier or level can gain meaningful insights into reticulate evolutionary processes from biological data. As our heuristic algorithm empirically produces good approximate solutions, theoretical analysis of its approximation ratio and the design of improved approximation algorithms also remain valuable pursuits.
References
- [1] Gabriel Cardona, Francesc Rosselló, and Gabriel Valiente. Comparison of Tree-Child Phylogenetic Networks. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 6(4):552–569, 2009. doi:10.1109/TCBB.2007.70270.
- [2] Nathan Davidov, Amanda Hernandez, Justin Jian, Patrick McKenna, K.A. Medlin, Roadra Mojumder, Megan Owen, Andrew Quijano, Amanda Rodriguez, Katherine St. John, Katherine Thai, and Meliza Uraga. Maximum Covering Subtrees for Phylogenetic Networks. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 18(6):2823–2827, November 2021. doi:10.1109/TCBB.2020.3040910.
- [3] Tomislav Došlić and Ivana Zubac. Counting maximal matchings in linear polymers. Ars Mathematica Contemporanea, 11:255–276, January 2016. doi:10.26493/1855-3974.851.167.
- [4] Péter L Erdős, Charles Semple, and Mike Steel. A class of phylogenetic networks reconstructable from ancestral profiles. Mathematical Biosciences, 313:33–40, 2019. doi:10.1016/j.mbs.2019.04.009.
- [5] E. J. Farrell. On the Occurrences of Fibonacci Sequences in the Counting of Matchings in Linear Polygonal Chains. The Fibonacci Quarterly, 24(3):238–246, 1986.
- [6] Mareike Fischer and Andrew Francis. How tree-based is my network? Proximity measures for unrooted phylogenetic networks. Discrete Applied Mathematics, 283:98–114, September 2020. doi:10.1016/j.dam.2019.12.019.
- [7] Andrew Francis, Charles Semple, and Mike Steel. New characterisations of tree-based networks and proximity measures. Advances in Applied Mathematics, 93:93–107, February 2018. doi:10.1016/j.aam.2017.08.003.
- [8] Andrew Francis and Mike Steel. Which Phylogenetic Networks are Merely Trees with Additional Arcs? Systematic Biology, 64(5):768–777, September 2015. doi:10.1093/sysbio/syv037.
- [9] Momoko Hayamizu. A Structure Theorem for Rooted Binary Phylogenetic Networks and Its Implications for Tree-Based Networks. SIAM Journal on Discrete Mathematics, 35(4):2490–2516, January 2021. doi:10.1137/19M1297403.
- [10] Momoko Hayamizu and Kazuhisa Makino. Ranking Top- Trees in Tree-Based Phylogenetic Networks. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 20(3):2349–2355, May 2023. doi:10.1109/TCBB.2022.3229827.
- [11] John Hopcroft and Robert Tarjan. Algorithm 447: Efficient Algorithms for Graph Manipulation. Communications of the ACM, 16(6):372–378, June 1973. doi:10.1145/362248.362272.
- [12] Remie Janssen and Yukihiro Murakami. On cherry-picking and network containment. Theoretical Computer Science, 856:121–150, 2021. doi:10.1016/j.tcs.2020.12.031.
- [13] Thomas Koshy. Fibonacci and Lucas Numbers with Applications. John Wiley & Sons, September 2001. doi:10.1002/9781118033067.
- [14] Charles Semple and Jack Simpson. When is a Phylogenetic Network Simply an Amalgamation of Two Trees? Bulletin of Mathematical Biology, 80(9):2338–2348, 2018. doi:10.1007/s11538-018-0463-x.
- [15] Takatora Suzuki, Han Guo, and Momoko Hayamizu. Bridging Between Deviation Indices for Non-Tree-Based Phylogenetic Networks. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 21(6):2226–2234, 2024. doi:10.1109/TCBB.2024.3456575.
- [16] Louxin Zhang. On Tree-Based Phylogenetic Networks. Journal of Computational Biology, 23(7):553–565, July 2016. doi:10.1089/cmb.2015.0228.
Appendix A Appendix
In what follows, is a rooted almost-binary phylogenetic network on the leaf-set , is the maximal zig-zag trail decomposition of , and is any ordered set of the elements of . In addition, for each , we let and write .
A.1 Proof of Lemma 11
We begin by proving the first statement. To show the one-to-one correspondence between and in the case of , we first prove that is a support network of if and only if is an -admissible subset of . Assume that is a support network of . Then, by definition. We will now verify that satisfies conditions (C1⋆) and (C2⋆). Recall that in Definition 10 is any subgraph of , so we may let be . For any with , we have since otherwise would hold and thus would have a leaf not in the leaf-set of . Also, for any with , we have since otherwise would hold and thus would have a root other than the unique root of . Hence, satisfies (C1⋆). Next, for any distinct with either or , at least one of is in since otherwise, by the assumption that is almost-binary, would be a leaf of or would be an extra root of . Hence, satisfies (C2⋆).
To prove the converse, assume that is an -admissible subset of . Then, satisfies (C1⋆) and (C2⋆). This implies that is a spanning subgraph of . Indeed, if there were with for any , then (C1⋆) or (C2⋆) (or both) would be violated, as easily verified for each possible case of . Let be the spanning subgraph of . Since satisfies both (C1⋆) and (C2⋆), if and only if , so the root of remains the unique root of . Similarly, if and only if , so the leaf-set of remains . Hence, is a support network of .
By the assumption on described in Remark 4, if and are different -admissible subsets of , then and are different support networks of , and the converse also holds. Thus, the map , defined by , is a bijection.
The above arguments apply for similarly to . Indeed, by Definitions 9 and 10, an -admissible subset belongs to if and only if the support network belongs to . Thus, the restriction of to gives a bijection between and . The same holds for . This completes the proof of the first statement.
To prove the second statement for , we show that a subset satisfies conditions (C1⋆) and (C2⋆) with if and only if satisfies (C1⋆) and (C2⋆) with for each . First, observe that if satisfies or , then there exists a unique such that . Since is a partition of , the condition (C1⋆) for with holds if and only if each satisfies the condition (C1⋆) with . Similarly, if two distinct edges share the same head or the same tail, then both edges must belong to a unique , i.e., . Then, the requirement that contains or is equivalent to requiring that contains or . Thus, satisfies (C2⋆) with if and only if each satisfies (C2⋆) with . Hence, is an -admissible subset of (i.e., is a support network of ) if and only if each is an -admissible subset of .
The result just proved for implies that any -admissible subset can be obtained by independently selecting, for each , an -admissible subset , and then taking their union: . Then, similarly to Remark 7, the admissible subset is minimal (resp. minimum) if and only if each is minimal (resp. minimum). This completes the proof.
A.2 Proof of Theorem 12
We prove that holds if is a fence and that holds if is a crown. Consider a fence . As the case of is trivial, we may assume . Let be the undirected graph obtained from by ignoring all edge orientations, and let be its subgraph induced by the non-terminal edges of . We can assume without loss of generality that is a path. Indeed, when is binary, this holds trivially because the fence visits each internal vertex exactly once. If is almost-binary but not binary, may fail to be a path due to the presence of a vertex with both and . However, we can transform by a preprocessing step that splits each such vertex into two vertices connected by a new edge, which ensures that is a path. This operation preserves and does not affect the set .
Let be a subset of and let . By Lemma 11 and Equation (3), is -admissible if and only if and for any , . Then, there exists a bijection between and the family of matchings in the path . For example, when , an -admissible subset of corresponds to a matching in . Theorem 1 in [5] says that the number of (possibly empty) matchings in a path with edges equals . Since , the number of matchings in equals . Hence, holds.
Consider a crown . Since a crown has no terminal edges, is -admissible if and only if for any , . Let be the undirected graph obtained from by ignoring all edge orientations. Using the pre-processing described above, we can treat as a cycle graph without loss of generality. Similarly to above, there is a bijection between and the family of matchings in the cycle . Theorem 2 in [5] says that the number of (possibly empty) matchings in a cycle with edges equals . Hence, holds.
Theorems 5.6 and 5.8 in [13] say that and are expressed by the closed-forms and , respectively. Then, and hold. This implies holds. This completes the proof.
A.3 Proof of Theorem 14
We prove that holds if is a fence and holds if is a crown. Consider a fence . As the case of is trivial, we may assume . Let be the undirected path as in the proof of Theorem 12. An -admissible subset of is minimal (namely, -admissible) if and only if a matching in corresponding to is maximal. Then, there exists a bijection between and the family of maximal matchings in . For example, when , a -admissible subset of corresponds to a maximal matching in . Proposition 3.1 in [3] say that the number of maximal matchings in a path with edges equals . Since , the number of maximal matchings in equals . Hence, holds.
Consider a crown . Let be the undirected cycle as in the proof of Theorem 12. Then, there exists a bijection between and the family of maximal matchings in the cycle . By the statement after Proposition 3.9 in [3], the number of maximal matchings in a cycle with edges equals . Thus, .
By [3], and hold. Hence, holds. This completes the proof.
A.4 Proof of Theorem 19
As Algorithm 1 checks the level of each support network in , our goal is to show that contains a level- support network of . To obtain a contradiction, assume that holds for any . This implies that any level- support network of is in . Since is not minimal, there exists a minimal one with . Then, since each block of is a subgraph of its corresponding block of , which is a contradiction. This completes the proof of the correctness of Algorithm 1.
Since is a binary network, we have . Also, as any minimal support network is a subgraph of , we have . For each , it takes time to compute its block decomposition using depth-first search [11]. For any block , one can compute in time. Then, for each , it takes time to calculate , namely the maximum across all blocks of . Overall, it takes time to obtain the minimum value of across all . By Theorem 14, holds.
