A New Paradigm for Identifying Reconciliation-Scenario Altering Mutations Conferring Environmental Adaptation

An important goal in microbial computational genomics is to identify crucial events in the evolution of a gene that severely alter the duplication, loss and mobilization patterns of the gene within the genomes in which it disseminates. In this paper, we formalize this microbiological goal as a new pattern-matching problem in the domain of Gene tree and Species tree reconciliation, denoted"Reconciliation-Scenario Altering Mutation (RSAM) Discovery". We propose an $O(m\cdot n\cdot k)$ time algorithm to solve this new problem, where $m$ and $n$ are the number of vertices of the input Gene tree and Species tree, respectively, and $k$ is a user-specified parameter that bounds from above the number of optimal solutions of interest. The algorithm first constructs a hypergraph representing the $k$ highest scoring reconciliation scenarios between the given Gene tree and Species tree, and then interrogates this hypergraph for subtrees matching a pre-specified RSAM Pattern. Our algorithm is optimal in the sense that the number of hypernodes in the hypergraph can be lower bounded by $\Omega(m\cdot n\cdot k)$. We implement the new algorithm as a tool, called RSAM-finder, and demonstrate its application to -the identification of RSAMs in toxins and drug resistance elements across a dataset spanning hundreds of species.


Introduction
Prokaryotes can be found in the most diverse and severe ecological niches of the planet. Adaptation of prokaryotes to new niches requires expanding their repertoire of protein families, via two evolutionary processes: first, by selection of novel gene mutants carrying stable genetic alterations that confer adaptation, and second, by dissemination of an adaptively mutated gene. These two processes are correlated: an adaptation-conferring mutation in a gene could accelerate its mobilization across bacterial lineages populating the corresponding environmental niche (Poirel et al. (2009)), and vice-versa, the mobilization of a gene by transposable elements increases its chances to mutate or "pick up" novel genomic context. Thus, an important research goal is to identify gene-level mutations that affect the spreading pattern of the mutated gene within and across the genomes harboring it.
For example, consider mutations conferring adaptation of bacteria to a human-pathogenesis environment. Here, a mutation to a resistance or virulence factor could enhance pathogenic adaptation, thus increasing the horizontal mobilization of the mutated gene within other human pathogens inhibiting this niche (Poirel et al. (2009)). In this case, we say that the mutation has a causal association with the observed dissemination pattern of the mutated gene (i.e. the increased mobilization of the gene among pathogenic bacteria). Identifying such mutations could inform infectious disease monitoring and outbreak control, and assist in identifying potential drug targets. The co-evolution of genes and their host species is classically described by computing the most parsimonious reconciliation scenario between a given Gene tree G and the corresponding Species tree S, that is, a mapping of each vertex u ∈ G to a vertex x ∈ S. Three major evolutionary processes, traditionally considered by reconciliation approaches, are horizontal gene transfer, gene duplication, and gene loss (Tofigh et al. (2011)). Each mapping of a vertex u ∈ G to a vertex x ∈ S is associated with one of these evolutionary events, and assigned a cost, accordingly (see Fig. 1). The optimization problem of computing a least-cost reconciliation between G and S, where the total cost is computed as the sum of the costs assigned to each of the map- Motivated by examples such as the one given above, we formalize a new pattern-matching problem in the domain of DLT reconciliation. Given are a Gene tree G, a corresponding Species tree S, a mapping σ from the leaves of G to the leaves of S, and (optional) an environmental annotation labeling the leaves of the input trees. Let H denote some data structure, to be defined later in the paper, that models the space of reconciliations between G and S. A DLT Reconciliation Scenario Pattern denotes a mapping between a vertex u ∈ G to a vertex x ∈ S, which obeys a set of user-defined specifications regarding the corresponding reconciliation event, the labels on the paired vertices, and other features associated with the mapping. Mappings between pairs of vertices (u ∈ G, x ∈ S) that abide by the requirements specified by P are denoted instances of P in H. Given a pre-specified DLT Reconcilation Scenario pattern P and a data structure H modeling the space of reconciliations between G and S, a Reconciliation Scenario Altering Mutation (RSAM) of P in H is a vertex v ∈ G representing a gene mutation with a putative causal association to instances of P in H. The RSAM Discovery problem is to identify RSAMs in G.
In what follows, we propose a three-stage solution to the RSAM Discovery problem defined above (illustrated in Fig. 2). The first stage constructs a hypergraph H that recursively aggregates all the k-best reconciliations of G and S. Each supernode in H consists of k hypernodes, where each hypernode represents a partial solution for the DLT-reconciliation problem. Our hypergraph-ensemble approach is based on a model proposed by Patro and Kingsford (2013) for network evolution, where here we extend and adapt it to the DLT model. This hypergraph of k-best reconciliations, intended to provide some robustness to the noise typical of this data, will serve as the search-space for the pattern-matching stage. The second stage of our proposed solution consists of assigning a probability to each partial solution, that is, to each hypernode of H. Finally, in the third stage, instances of the sought RSAM-pattern P are identified within H, and RSAM-ranking scores are assigned accordingly to the vertices of G. Based on these scores, vertices representing putative RSAMs are identified in G and subjected to biological interpretation.
The construction of H, in the first stage, is the computational bottleneck of the RSAM-Discovery pipeline mentioned above. Here, we adapt the approach proposed by Bansal et al. (2012) for the basic, 1-best variant of DLT reconciliation, extending it to an efficient k-best variant. This yields an O(m · n · k) time algorithm for the problem, where m and n are the number of vertices of the input Gene tree and Species tree, respectively, and k is a user-specified parameter that bounds from above the number of optimal solutions of interest. Our algorithm is optimal in the sense that the number of hypernodes in the hypergraph can be lower bounded by Ω(m · n · k).
We remark that a simpler, O(mn(n + k) log(n + k)) algorithm for hypergraph construction can be obtained by directly building upon the dynamic programming (DP) algorithm of Tofigh et al. (2011) and employing the method of cube pruning by Huang and Chiang (2005) to handle lists of (partial) k-best solutions efficiently. Pseudocode for this "naive" algorithm can be found in  Section 2. Just like Bansal et al. (2012) shaved off one n factor in the time complexity of the algorithm by Tofigh et al. (2011), so do we shave the n factor in the term (n+k) in the time complexity of the aforementioned naive algorithm. Surprisingly, we show that by relying on the improved DP algorithm, a further speed up is achieved-namely, the usage of a queue becomes unnecessary and therefore the log(n + k) factor is eliminated.
Our proposed solution to the problem defined in this paper is implemented as a tool called RSAM-finder, publicly available in Zoller (2019). We assert the performance of RSAM-finder in large scale simulations, and exemplify its application to the identification of RSAMs across a datasets spanning hundreds of species.
Previous Related Works. The DLT Reconciliation problem has been extensively studied. In particular, two main DLT variants have been considered: (1) the undated DLT-reconciliation, where the species are undated, and (2) the fully-dated DLT-reconciliation, where either each vertex in the Species (and Gene) tree is associated with an estimated date or the vertices of the Species (and Gene) tree are associated with a total order, and any reconciliation must respect these dates or order (i.e. an HT event can occur only between co-existing species).
In the acyclic version of these variants, there cannot exist two genes such that one is a descendant of the other, yet the descendant is mapped (in the Species tree S) to an ancestor of the other. Tofigh et al. (2011) showed that the acyclic undated version is NP-Hard. However, the acyclic dated version becomes polynomially solvable (Libeskind-Hadas and Charleston (2009)). Tofigh et al. (2011);Tofigh (2009) and David and Alm (2011) studied a version of the undated (cyclic) problem that ignores losses and proposed an O(mn 2 ) dynamic programming algorithm for it. They also gave a fixedparameter tractable algorithm for enumerating all optimal solutions. The time complexity of the algorithm was improved to O(mn) in Tofigh (2009) (under a restricted model that ignores losses) and in Bansal et al. (2012) (which does not ignore losses).
It is well known that the biological data used as input to the DLT Reconciliation problem could be inaccurate, whether due to a sequencing problem, a problem in the reconstruction of G or S (Bapteste et al. (2009)), or due to some other problem caused by noise. To overcome this problem, previous works try to examine more than one optimal solution, for example, see Donati et al. (2015) and Scornavacca et al. (2013). A probabilistic method for exploring the space of optimal solutions was suggested in Bansal et al. (2013) and Doyon et al. (2009), where the latter was improved in Doyon et al. (2011). Additional studies considered a space of candidate co-optimal sce-narios within special variants of the DLT problem, some of which employed special constraints to drive the search (Stolzer et al. (2012); ; Merkle et al. (2010); Charleston (1998)). Although all of the previous works reviewed in this paragraph compute a space of candidate reconciliation scenarios, none of these works considered the application of pattern matching on this space, as we do in this work.
DLT Reconcilation variants where the reconcilation computation is guided by constraints derived from vertex-coloring information, were proposed in applications studying host-parasite co-evolution, such as Berry et al. (2018), where the vertex coloring (in both G and S) represents the geographical area of residence. However, the applied constraints were "hard-wired" to the specific problem addressed in that paper. In contrast, the approach proposed in this paper is more general, supporting a pattern-search that is guided by a user-defined pattern. Our tool RSAM-finder provides the users with a query language able to express more robust patterns, according to the various applications where the pattern-search is to be employed.

Preliminaries
For a (binary) rooted tree T , let L(T ), V (T ), I(T ) and E(T ) denote the sets of leaves, vertices, internal vertices and edges, respectively, of T . Additionally, let V (T ) denote the set of finite (ordered) vectors over When T is clear from context, let V = V (T ) . Throughout, we treat any (binary) rooted tree T as a directed graph whose edges are directed from root to leaves. Then, if (u, v) ∈ E(T ), we say that v is a child of u, and u is the parent of v. For u, v ∈ V (T ), the notation v ≤ T u signifies that v is a descendant of u (alternatively, u is an ancestor of v), i.e. there is a directed path from u to v or u = v. We say that v is a proper descendent (resp. proper ancestor) of u if v ≤ T u (resp. v ≥ T u) and u = v, denoted < T (resp. > T ). When both u ≤ T v and v ≤ T u, we say that u and v are incomparable. For any u, v ∈ V (T ), let d T (u, v) denote the number of edges in the (unique simple undirected) path between u and v in T . When T is clear from context, we drop it from the notations v ≤ T u and d T (u, v). For any u ∈ V (T ), let T u denote the subtree of T rooted in u (then, DLT Scenario. A DLT scenario for two binary trees G (the Gene tree) and S (the Species tree) is a tuple σ, γ, Σ, ∆, Θ, Ξ where σ : L(G) → L(S) is a mapping of the leaves of G to the leaves of S, γ : V (G) → V (S) is a mapping of the vertices of G to the vertices of S, and (Σ, ∆, Θ) is a partition of I(G) (the set of internal vertices of G) into three event classes: Speciation (Σ), Duplication (∆) and Horizontal Transfer (Θ). The subset Ξ ⊆ E(G) specifies which edges are involved in horizontal transfer events. Additionally, the following constraints should be satisfied.
1. Consistency of σ and γ. For each leaf u ∈ L(G), γ(u) = σ(u). This constraint ensures that γ respects σ-that is, each leaf of G is mapped to the species where it is found.
2. Consistency of γ and ancestorship relations in S. For each u ∈ I(G) with children v and w: (a) γ(u) < S γ(v) and γ(u) < S γ(w). This constraint ensures that each of the two children (in G) of the gene u is mapped by γ to a species that is not a proper ancestor (in S) of the species to which the gene u is mapped; thus, it can be either a descendant of u or incomparable to u.
(b) At least one of γ(v) and γ(w) is a descendant of γ(u). This constraint ensures that at least one of the two children (in G) of the gene u is mapped by γ to a species that is a descendant (in S) of the species to which the gene u is mapped.
3. Identifying horizontal transfer edges. For each edge (u, v) ∈ E(G), it holds that (u, v) ∈ Ξ if and only if γ(u) ≤ S γ(v) and γ(v) ≤ S γ(u). This constraint identifies which edges are horizontal transfer edges-specifically, a horizontal transfer edge is an edge (u, v) ∈ E(G) from a gene u to a gene v that are mapped to species γ(u) and γ(v) that are incomparable.

4.
Associating events with internal vertices. For each u ∈ I(G) with children v, w: Losses. Our definition of a loss event is based on the definition given by Bansal et al. (2012). Consider a Gene tree G, a Species tree S and a corresponding DLT scenario α = σ, γ, Σ, ∆, Θ, Ξ . Let u ∈ V (G) with children v and w (if they exist). Define Loss α (u) as the number of losses at u. Intuitively, the number of losses at a vertex u is the number of "skips" the gene made in the tree S at the evolutionary event that u represents. Formally, is the distance between u and v in the tree Species S. The formula above determines that the number of losses in a vertex u ∈ V (G) is based on the event that occurred in u. First, if u ∈ Σ (i.e. u represents a speciation event), then the number of losses is the sum of the distances between u and its two children in the Species tree (by the mapping γ) without counting the first step. If u ∈ ∆ (i.e. u represents a duplication event), then the number of losses is the sum of the distances between u and its two children in the Species tree. If (u, w) ∈ Ξ (i.e. u represents a horizontal transfer event that happened in the edge (u, w)), then the number of losses is the sum of the distances between u and its other child (i.e. v) in the Species tree.
Costs. Let c Σ , c ∆ , c Θ and c loss denote the costs of a speciation event, a duplication event, a horizontal transfer event and a loss event, respectively. Let Loss α = u∈V (G) Loss α (u). Let |Σ| · c Σ + |∆| · c ∆ + |Θ| · c Θ + Loss α · c loss be the reconciliation cost of α. When seeking a "best" DLT scenario, the goal is to find one that minimizes this cost.

Hypergraph of k-Best Scenarios
To represent k-best solutions, 2 we use a directed hypergraph denoted by H based on the notation in Huang and Chiang (2005). The hypergraph is a tuple H = V, E , where V is a finite set of hypernodes, and E is a finite set of (directed) hyperedges defined as follows. Each e ∈ E is a pair T (e), h(e) , where h(e) ∈ V is the head of e, and T (e) ∈ V * (i.e. T (e) is a vector of vertices in V ) is its tail. In our settings, |T (e)| = 2 for every e ∈ E. In what follows, we define the hypernodes and hyperedges of H with respect to our problem. To exemplify this, we refer the reader to Fig. 3. In part B of this figure, the hypernode (u 5 , x 5 , 1) is annotated with score 1 and event 'S'.
To extract the best solution from the hypergraph, we begin with the first (i.e. top-ranking) slot in the root of the hypergraph (which is (u 5 , x 5 , 1) in the figure), and then follow the incoming hyperedges in top-down order. In the figure, the best solution is solution (1) in part C of the figure. To extract it from the hypergraph in part B of the figure, we map u 5 to x 5 with a cost of 1 and a speciation event. Then, by first following the hyperedges incoming to (u 5 , x 5 , 1), we derive the mapping of u 3 to x 1 , and of u 4 to x 3 . Finally, by following the hyperedges incoming to (u 3 , x 1 , 1), we also derive the mapping of u 1 to x 1 , and of u 2 to x 2 . The second best solution (solution (2) in part C of the figure) is extracted in the same manner-now, we start with the hypernode (u 5 , x 5 , 2) rather than (u 5 , x 5 , 1), and again follow incoming hyperedges in a top-down order until we reach the leaves. Similarly, we can extract all three non-nil solutions among the 4-best solutions (illustrated in part C). As before, the outer tubes illustrate the edges of S, and the edges of G are embedded inside based on the reconciliation.
• Hypernodes. For every vertex u in G, a vertex x in S and an integer i ∈ {1, . . . , k}, we have a hypernode (u, x, i) in H. Such a hypernode (u, x, i) is associated with the i th best (where ties are broken arbitrarily) solution mapping the subtree of G rooted in u to the subtree of S rooted in x that is a DLT scenario. In addition, for every integer i ∈ {1, . . . , k} we have a hypernode (root, i) in the hypergraph H. Such a hypernode (root, i) is associated with the i th best solution of mapping G (entirely) to any subtree of S. Each hypernode (u, x, i) has a score c(u, x, i), and each hypernode (root, i) has a score c(root, i). Moreover, each hypernode (u, x, i) is associated with the event corresponding to the mapping of u and x in the DLT scenario of (u, x, i) (speciation, duplication or horizontal transfer), denoted event(u, x, i).
is the set of k hypernodes corresponding to the mapping of the subtree of G rooted in u to the subtree of G rooted in x). This notation will simplify our presentation.
• Hyperedges. We remind the reader that each hypernode (u, x, i) ∈ V (H) describes a DLT scenario. Each hypernode has exactly one incoming hyperedge, but it can have multiple outgoing hyperedges. In particular, for each hypernode (u, x, i) ∈ V (H), the (only) incoming hyperedge e = T (e), h(e) = [(v, y, j), (w, z, r)], (u, x, i) describes the mapping of the subtrees of the children of u, namely, v and w, in the scenario of (u, x, i); here, the subtree of v is mapped to the subtree of y as in the scenario of (v, y, j), and the subtree of w is mapped to the subtree of z as in the scenario of (w, z, r).

Framework and Algorithms
In this section, we elaborate on each of the three stages of the workflow in Section 1.

Stage 1: Hypergraph Construction
The first stage of our framework is to construct the hypergraph described in Section 3. To this end, we develop an efficient algorithm that runs in time O(m · n · k) and requires O(m · n · k) space.
An Overview of the Algorithm. We iterate over all u ∈ V (G) in postorder, as well as over all x ∈ V (S) in postorder. (However, as explained immediately, when we consider a vertex u ∈ V (G), after iterating over all vertices x ∈ V (S) in postorder, we also iterate over all vertices x ∈ V (S) in preorder.) In each iteration, corresponding to a pair (u, x), we construct three lists: p Σ (speciation), p ∆ (duplication) and p Θ (horizontal transfer). Specifically, p Σ should be a list of k-best solutions that are DLT scenarios where the subtree of G rooted in u is mapped to the subtree of S rooted in x under the restriction that the event corresponding to matching u and x is speciation. The meaning of the lists p ∆ and p Θ is similar, where the restriction of speciation is replaced by duplication or horizontal transfer, respectively. Having these three lists suffices to construct the hypernode (u, x).
To avoid repetitive computation, we maintain three additional lists: subtree, subtreeLoss and incomp. Intuitively, subtreeLoss(u, x, i) represents the i th best cost of reconciliation of the tree rooted in u, such that u may be mapped to any y ≤ x with a additional cost of one loss per edge in the path from x to y, and incomp(u, x, i) represents the i th best cost of a reconciliation of the subtree of G rooted in u with some subtree of S whose root is a vertex y incomparable to x. subtree is used in order to efficiently compute incomp. The notations subtreeLoss(u, x) , subtree(u, x) and incomp(u, x) refer to the lists of the k-best scores {subtreeLoss(u, , respectively, similarly to our usage of the notation of a supernode. The efficient computation of p Σ , p ∆ and p Θ , along with the maintenance of subtreeLoss, subtree and incomp themselves, is highly non-trivial. On a highlevel, we first initialize all five lists to contain only costs of ∞; then, still in the initialization phase, we add hypernodes that match between leaves of G and S in accordance with σ and update subtreeLoss and subtree consequently. After the initialization, the main computation considers each u ∈ V (G) in postorder, and performs two steps. In the first step, we consider each x ∈ V (S) in postorder. Then, for each i ∈ {1, . . . , k}, we compute p Σ (u, x, i), p ∆ (u, x, i) and p Θ (u, x, i) based on somewhat involved recursive formulas. Afterwards, we construct the surpernode (u, x), as well as compute the lists subtreeLoss(u, x) and subtree(u, x). In the second step, we consider each x ∈ I(S) with children y and z in preorder, and compute the lists incomp(u, y) and incomp(u, z).
Having constructed all hypernodes of the form (u, x, i) along with their ingoing hyperedges, it is trivial to construct the hypernodes of the form (root, i) and their ingoing edges.
Psudocode. The psudocode is given in Fig. 4 and 5. We use the notation imin, defined as follows: Let X and Y be sets, and consider a function . In case f is not an injective function, hence there are multiple choices for x, we break ties arbitrarily. We proceed with a few clarifications of the pseudocode.
Initialization: Lines 1-13. We initialize all lists to contain only scores of ∞ (lines 1-3). Then, the lists associated with a matching between leaves that comply with σ-that is, supernodes of the form (u, σ(u)) for some u ∈ V (G)-are inserted into the hypergraphs, and their topmost items are updated with a leaf event, cost 0, and subtreeLoss and subtree 0 (because the cost of the best solutions mapping a gene to its species is 0). Division into First and Second Phases: Lines 14-39. For each vertex u ∈ I(G) in postorder (line 14), we have two phases, on which we elaborate below. In the first phase (lines 15-34), we consider each vertex x ∈ V (S) in postorder and perform most computations, and in the second phase (lines 35-38) we consider each vertex x ∈ V (S) in postorder and compute the lists of incomp.
Recursive Formulas for p Σ , p ∆ and p Θ : Lines 16-30. In this part of the first phase, we find the k-best costs for mapping the subtree of G rooted in u to the subtree of S rooted in x for each possible event (speciation, duplication or horizontal transfer), based on computations done in previous iterations or the initialization. The recursive formulas for these computations  First, in line 31, we immediately find k-best costs for mapping the subtree of u to the subtree of x (i.e. we compute c(u, x)) by selecting k-best costs from the list that is the combination of p Σ (u, x), p ∆ (u, x) and p Θ (u, x). Notice that in this line, we also add the appropriate hypernodes and hyperedges to the hypergraph. event(u, x, i) is defined by the source list (p Σ (u, x), p ∆ (u, x) or p Θ (u, x)) it came from. As before, if the combined list is shorter than k, we add hypernodes with event = Nan and cost = ∞. Secondly, in lines 32-33, we find k-best costs for mapping the subtree of u to the subtree of some vertex x in the subtree of x, with and without loss events (i.e. we compute subtreeLoss(u, x) and subtree(u, x)) by selecting k-best costs from the combination of pre-calculated lists.
Updating incomp in Second Phase: Lines 35-38. To compute the lists of the form incomp(u, ·), in the second phase we iterate over all vertices x ∈ I(S) with children y and z in preorder. We note that now the traversal of S is in preorder rather than postorder because the computation of a list incomp(u, a) for a vertex a ∈ V (S) that is not the root of S relies on having already computed the list incomp(u, b) where b is the parent of a in S. Specifically, for a vertex x ∈ I(S) with children y and z, we compute the list incomp(u, y) by selecting k-best costs from the list that is the combination of incomp(u, x) and subtree(u, z), and symmetrically for incomp(u, z) (swapping the roles of y and z).
Lemma 1. Given an instance (G, S, σ) of the DLT problem and a positive integer k, the algorithm correctly constructs a hypergraph H that represents k-best solutions for (G, S, σ).
Proof of Lemma 1. We prove that for every pair of vertices u ∈ V (G) and x ∈ V (S), and every index i ∈ {1, . . . , k}, if there exists an i th best DLT scenario mapping the subtree of G rooted in u to the subtree of S rooted in x, then the hypernode (u, x, i) is inserted into the hypergraph H under construction with association to this scenario. In this lemma, the proof of this claim is done in conjunction with the proof that for every pair of vertices u ∈ V (G) and x ∈ V (S), and every index i ∈ {1, . . . , k}, the following equalities hold.
• subtreeLoss(u, x, i) is the i th best cost of a DLT scenario mapping the subtree of G rooted in u to some subtree of S whose root is a vertex y that is a descendant of x, with additional cost of one loss per each edge in the path from x to y.
• subtree(u, x, i) is the i th best cost of a DLT scenario mapping the subtree of G rooted in u to some subtree of S whose root is a vertex y that is a descendant of x.
• incomp(u, x, i) is the i th best cost of a DLT scenario mapping the subtree of G rooted in u to some subtree of S whose root is a vertex y that is incomparable to x.
The proof is by induction on the order of computation. In particular, table 1 (u 1 , x 1 ) < table 2 (u 2 , x 2 ) where table 1 , table 2 ∈ {c, subtreeLoss, subtree, incomp} if one of the following conditions holds: • u 1 is visited before u 2 in the postorder traversal of G.
• u 1 = u 2 , table 1 = table 2 = incomp, and x 1 is visited before x 2 in the preorder traversal of S.
The basis of the induction comprises of the computation of hypernodes of the form (u, x, i) where u ∈ L(G). To prove its correctness, consider such a hypernode (u, x, i). If i = 1 and x = σ(u), then the algorithm inserts the hypernode (u, x, 1), assigning it a score of 0, and setting the remaining fields as follows: subtreeLoss = 0 , subtree = 0 and event = leaf; else, if i = 1 and x ≥ S σ(u), then the algorithm inserts the hypernode (u, x, 1), assigning it a score of 0 and setting the remaining fields as follows: subtreeLoss = c Loss · d S (x, σ(u)), subtree = 0 and event = leaf; otherwise, the algorithm does not insert the hypernode-more precisely, it inserts a placeholder (whose event is NaN) with score ∞ and subtree value ∞. In both cases, incomp value remains ∞ as in its creation. The correctness of these operations directly follows from the definitions of subtreeLoss, subtree and incomp, and the fact that the only possible DLT scenario in this case maps u to an ancestor of σ(u), and the score of this match is 0 in case losses are not counted, or with the additional loss costs otherwise.
For the inductive step, we consider some pair of vertices u ∈ I(G) and x ∈ V (S) along with a table table ∈ {c, subtreeLoss, subtree, incomp}, and prove that the values in table of the supernode (u, x) are computed correctly. For the inductive assumption, suppose that for every triple (table , u , x ) ordered before (table, u, x), the values in table of (u , x ) have already been computed correctly. Here we provide a proof for table = c and table = incomp. The full proof can be found in Section 3 of .
First, consider the case where table = incomp. By the pseudocode, if x is the root of S, then incomp(u, x) does not contain any item (having score different from ∞) as in its creation, which is correct because in this case, there exists no vertex incomparable to x and hence we cannot map one of the children of u as required in the definition of the DLT scenarios that correspond to incomp(u, x). Therefore, now suppose that x is not the root of S, and let p denote the parent of x in S, and s denote the sibling of x in S (i.e. the other child of p in S). Then, by the pseudocode, incomp(u, x) consists of the k-best scores from the lists incomp(u, p) and subtree(u, s). Observe that these two lists have already been computed. Thus, by the inductive hypothesis, incomp(u, p) consists of the scores of the k-best DLT scenarios mapping the subtree of G rooted in u to the subtree of S rooted in some vertex incomparable to p, and subtree(u, s) consists of the scores of the k-best DLT scenarios mapping the subtree of G rooted in u to the subtree of S rooted in some descendant of s. Notice that a DLT scenario maps the subtree of G rooted in u to a subtree of S rooted in some vertex incomparable to x if and only if it is a DLT scenario that maps the subtree of G rooted in u to one of the following subtrees: (i) a subtree of S rooted in some vertex incomparable to p; (ii) a subtree of S rooted in some descendant of s. Thus, it follows that incomp(u, x) is computed correctly.
Second, consider the case where table = c. By line 31 of the pseudocode, c(u, x) consists of the k-best scores from the lists p Σ (u, x), p ∆ (u, x) and p Θ (u, x). Thus, to prove the correctness of the computation of c(u, x), it suffices to prove that the following statement holds: p Σ (u, x), p ∆ (u, x) and p Θ (u, x) consist of k-best DLT scenarios mapping the subtree of G rooted in u to the subtree of S rooted in x under the constraint that the event corresponding to the matching of u and x is speciation, duplication and horizontal transfer, respectively.
Towards the proof of the statement, consider the list p Σ (u, x). If x ∈ L(S), then because u ∈ I(G), there does not exist a DLT scenario mapping the subtree of G rooted in u to the subtree of S rooted in x under the constraint that the event corresponding to the matching of u and x is speciation, and hence the assignment of ∞ to every element p Σ (u, x, i) of the list is correct. Now, suppose that x ∈ I(S). Then, by the pseudocode, p Σ (u, x) consists of the k-best scores present in the following multisets: • {subtreeLoss(v, y, j) + subtreeLoss(w, z, r) | 1 ≤ j, r ≤ k} and • {subtreeLoss(w, y, j) + subtreeLoss(v, z, r) | 1 ≤ j, r ≤ k}.
Observe that the lists subtreeLoss(v, y), subtreeLoss(w, z), subtreeLoss(w, y) and subtreeLoss(v, z) have already been computed. By the definition of Loss α (u) when the event occurred in u is speciation, otherwise. Thus, by the inductive hypothesis, subtreeLoss(v, y) (resp., subtreeLoss(w, z), subtreeLoss(w, y) and subtreeLoss(v, z)) consists of the scores of the k-best DLT scenarios mapping the subtree of G rooted in v (resp., w, w and v) to a subtree of S rooted in some descendant of y (resp., z, y and z), with additional loss cost for each edge in the path from y to γ(v) (resp., γ(w), γ(w) and γ(v)). Notice that a DLT scenario maps the subtree of G rooted in u to a subtree of S rooted in x under the constraint that the event corresponding to the matching of u and x is speciation if and only if it is a DLT scenario that matches u and x, maps the subtree of G rooted in v to a subtree of S rooted in a descendant of one child (y or z) of x, and the subtree of G rooted in w to a subtree of S rooted in a descendant of the other child of x. Thus, it follows that p Σ (u, x) is computed correctly. Now, consider the list p ∆ (u, x). In case x ∈ I(S), let y and z denote its children. By the pseudocode, p ∆ (u, x) consists of the k-best scores obtained by adding c ∆ to the costs present in the following multisets, where only the first one is relevant in case x ∈ L(S): • {c(w, x, j) + subtreeLoss(v, z, r) | 1 ≤ j, r ≤ k} + c Loss , • {subtreeLoss(v, y, j) + subtreeLoss(w, z, r) | 1 ≤ j, r ≤ k} + 2c Loss , • {subtreeLoss(w, y, j) + subtreeLoss(v, z, r) | 1 ≤ j, r ≤ k} + 2c Loss , • {subtreeLoss(v, y, j) + subtreeLoss(w, y, r) | 1 ≤ j, r ≤ k} + 2c Loss and • {subtreeLoss(v, z, j) + subtreeLoss(w, z, r) | 1 ≤ j, r ≤ k} + 2c Loss .
Observe that the lists c(v, x), c(w, x), subtreeLoss(w, y), subtreeLoss(v, z), subtreeLoss(w, z) and subtreeLoss(v, y) have already been computed. By the definition of Loss α (u) when the event occurred in u is duplication, Loss α (u) = d S (x, γ(v)) + d S (x, γ(w)). If v (resp. w) is mapped to x and w (resp. v) is mapped to a subtree of S rooted in y or z, it holds that Loss α (u) = d S (y, γ(w)) + 1 (resp. Loss α (u) = d S (y, γ(v)) + 1, Loss α (u) = d S (z, γ(w)) + 1 and Loss α (u) = d S (z, γ(v)) + 1). If both v and w are mapped to x, Loss α (u) = 0, and if v (resp. w) is mapped to y or z and w (resp. v) is mapped to y or z, it holds that Loss α (u) = d S (y, γ(v)) + d S (z, γ(w)) + 2 (resp. Loss α (u) = d S (y, γ(v)) + d S (z, γ(w)) + 2, Loss α (u) = d S (y, γ(v)) + d S (v, γ(w))+2, Loss α (u) = d S (z, γ(v))+d S (z, γ(w))+2, Loss α (u) = d S (z, γ(v))+ d S (z, γ(w)) + 2). Thus, by the inductive hypothesis, we have that (i) c(v, x) (resp. c(w, x)) consists of the scores of the k-best DLT scenarios mapping the subtree of G rooted in v (resp. w) to the subtree of S rooted in x, and (ii) subtreeLoss(v, y) (resp. subtreeLoss(w, z), subtree(w, y) and subtreeLoss(v, z)) consists of the scores of the k-best DLT scenarios mapping the subtree of G rooted in v (resp. w, w and v) to the subtree of S rooted in some descendant of y (resp. z, y and z) with additional loss cost for each edge in the path from y to γ(v) (resp. γ(w), γ(w) and γ(v)). Notice that a DLT scenario maps the subtree of G rooted in u to a subtree of S rooted in x under the constraint that the event corresponding to the matching of u and x is duplication if and only if it is a DLT scenario that matches u and x, maps the subtree of G rooted in v to a subtree of S rooted in a descendant of x (which can be x itself), and the subtree of G rooted in w to a subtree of S rooted in a descendant of x (which can be x itself). Thus, it follows that p ∆ (u, x) is computed correctly.
Lastly, consider the list p Θ (u, x). If x is the root of S, then there does not exist a DLT scenario mapping the subtree of G rooted in u to the subtree of S rooted in x under the constraint that the event corresponding to the matching of u and x is horizontal transfer (because there is no vertex incomparable to x to whom one of the children of u should be mapped), and hence it is correct that each element p Θ (u, x, i) remains with the assignment of ∞ as it was created. Now, suppose that x is not the root of S. Then, by the pseudocode, p Θ (u, x) consists of the k-best scores present in the following multisets: Observe that the lists subtreeLoss(v, x), subtreeLoss(w, x), incomp(w, x) and incomp(v, x) have already been computed. By the definition of Loss α (u) when (u, w) ∈ Ξ , Loss α (u) = d S (x, γ(v)). Thus, by the inductive hypothesis, we have that (i) subtreeLoss(v, x) (resp. subtreeLoss(w, x)) consists of the scores of the k-best DLT scenarios mapping the subtree of G rooted in v (resp. w) to a subtree of S rooted in some descendant of x (which can be x itself), with additional loss cost for each edge in the path from x to γ(v) (resp. γ(w)).
(ii) incomp(v, x) (resp. incomp(w, x)) consists of the scores of the k-best DLT scenarios mapping the subtree of G rooted in v (resp. w) to a subtree of S rooted in some vertex incomparable to x. Notice that a DLT scenario maps the subtree of G rooted in u to a subtree of S rooted in x under the constraint that the event corresponding to the matching of u and x is horizontal transfer if and only if it is a DLT scenario that matches u and x, maps the subtree of G rooted in one of the children of u (v or w) to a subtree of S rooted in a descendant of of x (which can be x itself), and the subtree of G rooted in the other child of u to a subtree of S rooted in a vertex incomparable to x. Thus, it follows that p Σ (u, x) is computed correctly.
Observation 1. Given an instance (G, S, σ) of the DLT problem and a positive integer k, the algorithm runs in time O(m · n · k) an requires O(m · n · k) space.
Proof of observation 1. For each pair of vertices u ∈ V (G) and x ∈ V (S), we construct a tuple of lists (p Σ (u, x), p ∆ (u, x), p Θ (u, x), subtreeLoss(u, x), subtree(u, x), incomp(u, x)). From the pseudocode, it is clear that the computation of each one of these lists is done in time O(k). Thus, we have that the total running time is O(m · n · k). As space is bounded by time, the observation follows.

Stage 2: Assigning Probabilities
In the second stage, we assign a probability to each hypernode in H, so that a hypernode with best score has the highest probability, and hypernodes with score ∞ (the worst possible score) have probability 0.
Weight Computation. Let γ ∈ R + be a user-specified parameter. γ is used to control the range between poorly scoring nodes versus top scoring nodes. As γ grows lower, hypernodes with higher (worse) scores are assigned probabilities much lower than hypernodes with lower scores.
We now turn to define the weight of a hypernode (u, x, i), which should stand for the (unconditional) probability that the scenario described by (u, x, i) happens. The definition is recursive. In the basis, where u is the root of G, we define w(u, x, i) (for any x ∈ V (S) and i ∈ {1, . . . , k}) as follows: if there exists an index j ∈ {1, . . . , k} such that (r, j) is derived from (u, x, i) (here, it means that they represent the same scenario), then w(u, x, i) = w(r, j); otherwise, w(u, x, i) = 0. Now, consider v that is not the root of G. We define w(v, y, i) (for any y ∈ V (S) and i ∈ {1, . . . , k}) as follows. First, let D(v, y, i) denote the collection of nodes (u, x, j) such that c(u, x, j) was derived from c(v, y, i)-in other words, the hypergraph has an hyperedge directed from (v, y, i) (and some other node) to (u, x, j). In particular, u is the parent of v in G, hence the weight w(u, x, j) is calculated before the weight w(v, y, i). Then, define w(v, y, i) = (u,x,j)∈D(v,y,i) w(u, x, j).
Proof of Lemma 2. We will verify a stronger property than the one in the statement of the lemma: For every vertex u in the Gene tree G, it holds that Before we verify this property, observe that when u is a leaf, then c(u, x, 1) = 0 for the unique vertex x that is compatible with u, and c(u, x, i) = ∞ (which means that D(u, x, i) = ∅ and hence w(u, x, i) = 0) for any other pair (x, i).
Thus, the stronger property implies the correctness of the weaker statement regarding leaves.
To prove the (stronger) property above, we use induction. In the basis, u is the root of the Gene tree G. Then, we have that x,i:(u,x,i)∈V (H) w(u, x, i) = i∈{1,2,...,k} w(r, i) = 1, and therefore the property holds. Now, suppose that u is not the root of G, and that the property holds for each of its ancestors.
Let v be the parent of u in H. Then, we have that Here, the first equality follows directly from the definition of weights. The second equality follows from the fact that each hypernode (v, y, i) (for any y and i) that has positive weight is derived from exactly one hypernode (u, x, j) (for some specific x and j). (However, each hypernode (u, x, j) can be used to derive several hypernodes (v, y, i).) The last equality follows from the inductive hypothesis. This completes the proof.

Stage 3.1: Pattern Discovery
The current version of RSAM-finder allows pattern queries to be specified as follows. A pattern specification consists of a tuple (EV, color, distance) where: 1. EV ⊆ {S, D, HT} specifies the evolutionary event of the pattern (S for speciation, D for duplication and HT for horizontal transfer).
2. color ∈ {red, black, None} specifies a color representing the environmental niche to which the sought RSAM confers adaptation.
3. distance ∈ {True, False} is a boolean indicator specifying whether or not to consider edge lengths (representing evolutionary distances) in the pattern specification.
For a colored query (having the second parameter in the specification set to red or back), the user can provide, as part of the input, a function colors : L(X) → Υ where X specifies whether the pattern refers to a subtree of S or a subtree of G, and Υ = {red, black}. Here, colors represent a binary environmental annotation of the leaves. Then, a preprocessing step is applied, in which the vertices of S and G are colored based on the colors assigned to the leaves of the subtree they root. We omit the technical details entailing the implementation of this preprocessing step to Section 1 of .
In addition to the settings described above, the user can select one of two modes: 1. Single-pattern mode. In this mode, the user specifies a single pattern and a threshold, and the sought RSAMs are identified as nodes u ∈ I(G) such that G u is enriched in the pattern, and |V (G u )| is bounded from below by the specified threshold.
2. Dual-pattern (contrasting) mode. In this mode, the user specifies two patterns and one threshold, and the sought RSAMs are identified as nodes u ∈ I(G) with children v, w ∈ V (G) such that G v is enriched with one pattern while G w is enriched with the other pattern. Here, the subtree size bound threshold refers to |V (G v )| and |V (G w )|.
The Pattern Identification algorithm proceeds as follows.
1. For each pattern P = (EV, color, distance) and for each hypernode (u, x, i) ∈ V (H), check whether both event(u, x, i) ∈ EV and the colors obey the requirements derived from the color field of the pattern specification (described in more details in Section 1 of Zoller et al. (2019)). If so, mark (u, x, i) as interesting.
2. Reflect the interesting nodes identified in H to G, by assigning corresponding weights to V (G). Each u ∈ I(G) is assigned a score, which is the sum of the probabilities of instances of the pattern found in G u , normalized by the number of possible events in G u . Additional book-keeping details regarding how this score is computed are given in Section 4.4.
Based on the specified mode of the query (single pattern or dual pattern), identify the t top scoring vertices u ∈ I(G). In case of a singlepattern mode, the scores are as defined in (2). In case of dual-pattern mode, let P 1 and P 2 be the patterns. For each u ∈ V (G) with children v, w ∈ V (G) the score of u is score of v for P 1 (as defined in (2)) plus the score of w for P 2 , and vice versa (that is, each vertex is assigned two scores).

Stage 3.2: Score Computation.
For each defined pattern P = (EV, color, distance), let counter P : V (G) → R + be a counter, initialized by 0. For each u ∈ I(G) ,let v and w be its right and left children, respectively. Let is marked as interesting with respect to P }.
That is, for each vertex x ∈ V (S) and i ∈ {1, . . . , k} such that (u, x, i) ∈ V (H) was marked as interesting in stage 1 with respect to pattern P , (u, x, i) ∈ I u . Let where w(u, x, i) are the probabilities assigned in Section 4.2. Intuitively, for each vertex u ∈ V (G) we calculate its probability to be interesting, with respect to the patterns we defined. In order to avoid a bias due to variation in the sizes of the subtrees rooted by the competitively estimated nodes in G, we normalise each value by the number of edges in the subtree rooted in the vertex times k, which is an upper bound on the number of possible patterns in all the solutions. That is, for each vertex u ∈ V (G) and pattern P , let counter P (u) = counter P (u) |E(Gu)|·k .

Experimental Results
We implemented the algorithm described in this paper as a tool, denoted RSAM-finder, and made it publicly available via GitHub (Zoller (2019)).
In this section we test and exemplify the performance of RSAM-finder. The tests are based on large scale simulations, where we demonstrate the engine's tolerance to noise (Subsection 5.2), and measure the practical running times of the proposed hypergraph construction algorithm as a function of increasing input size (Subsection 5.3). In Subsection 5.4 we exemplify an application of our proposed approach to the discovery and analysis of RSAMs in a Beta Lactamase gene. But first, in Subsection 5.1, we give the technical details regarding our simulations, tests and experiments.

Methods and Data Bases
Genes in our experiment are represented by their membership in Cluster of Orthologous Genes (Tatusov et al. (2000)). The STRING database (Szklarczyk et al. (2016)) was used to extract the chromosomal protein sequences for the COGs of interest, annotated with their corresponding species names as well as the corresponding NCBI IDs. Protein sequences were subjected to multiple sequence alignment and dendogram construction via Clustal Omega (Sievers and Higgins (2018)). The list of NCBI IDs was used as input for NCBI Taxamony Browser which provided a (non-binary) Species tree. Both Gene and Species trees were converted to binary trees via the Ape R package (Popescu et al. (2012)). Habitat labels for the species were extracted from PATRIC, and missing tags were manually annotated by information from the GOLD database (Mukherjee et al. (2016)) and from literature. CD Search (Marchler-Bauer and Bryant (2004)) was employed to seek statistically significant discriminating domain-level mutations (i.e. the gain or loss of a protein functional domain). The simulator and our algorithm were im-plemented in Python, using NetworkX package, DendroPy (Sukumaran and Holder (2010)) and ETE Toolkit (Huerta-Cepas et al. (2016)). Visualization of the trees and plots were created using Matpllotlib and Seaborn tools.
For the simulation-based experiments, we generated random binary trees. The generation of a random binary tree was done in a top down manner, using the ETE Toolkit (Huerta-Cepas et al. (2016)). We began with a given set of vertices, based on which we created a random binary tree. The tree was duplicated and one copy was denoted G, while the other was denoted S. The function σ : L(G) → L(S) was implemented as the matching between each leaf in G to its copy in S, and the function color : L(S) → {red, black} was implemented as a random binary function. To implant the pattern in the resulting random trees, we picked a random vertex u ∈ V (G), and modified the function σ : L(G) → L(S) for all vertices w ∈ L(G u ) in a way that created a Horizontal Transfer event. To this end, consider a vertex w ∈ V (G u ). Vertex w is made to represent a Horizontal Transfer event as follows. Let x ∈ V (S) be the copy of w in S. Let L(S x ) denote the copy of L(G w ) (the leaves of the subtree rooted in w) in the Species tree, thus the function σ maps each leaf of G w to its copy in the leaves of S x . Then, to create a Horizontal Transfer in w, we need to find a vertex y ∈ V (S) such that y and x are incomparable, and change the mapping of the leaves of G w to the leaves of S y randomly -that is, for each vertex r ∈ G w define σ(r) to be a random vertex z ∈ S y . This is likely to create a Horizontal Transfer in the DLT-reconciliation. Recall that in addition, we want to make those planted Horizontal Transfer events red-to-red events. To achieve this, we check to see if the random vertices u and y, which are the source and the target of the Horizontal Transfer, are "mostly red", as defined in Section 1 of . If they are not, we make another random choice and check the colors again. The query pattern ({HT}, red, True) was used in the simulation-based experiments. According to this pattern, we sought subtrees that are enriched in red-to-red Horizontal Transfer events. (For additional details, see Section 4.3.) The purple dots represent the planted vertex, and they obey the sought pattern. (B) Running times of the naive and efficient algorithms. Green triangles represents the efficient version and red circles represents the running times of the naive algorithm.

Testing for Noise Tolerance
We tested our tool on a random data set that was generated as described above, by introducing into the simulations an additional "noise factor" affecting Horizontal Transfers and colors. Each noise level represents the level of random changes in σ and random colors of the species. In particular, a noise level of 0% means that no changes were done to the mapping between the leaves of the Gene tree to the leaves of the Species tree except those of the planted pattern, and no change was made to the function color : L(S) → {red, black}, while a noise level of 100% means that the mappings of all of the vertices of the Gene tree were randomly picked, and all of the species colors were randomly picked again. Fig. 6(A) demonstrates the advantage of our approach across different noise levels, following the strategy described above to generate randomized phylogenetic Gene and Species trees with a planted pattern. First, we constructed random phylogenetic Species and Gene trees with 600 leaves and one planted pattern (marked in purple in Fig. 6(A)). For each noise level between 0% to 20% we constructed the corresponding hypergraph. For all experiments, we used k = 100, set the minimum size of a subtree to 0.1% of the number of all edges, and set c ∆ = c Θ = 1. Results for each noise level were computed as an average of 50 random choices for the same noise level, on the same input trees. The scores are as defined in Section 4.3.
We found that, at the lower noise levels, the score of the planted vertex u ∈ V (G) is higher than that of any other vertex, and this difference decreases as the noise level increases. Note that the additional noise increases the number and scores of false positives found. These findings support the claim that our method is able to find a pattern within noisy data.

Running Time Measurements
To demonstrate the practicality of the theoretical improvements presented in Section 4.1, we compared the running times of the efficient, O(mnk) time algorithm for hypergraph construction proposed in Section 4.1, versus the naive, O(mn(n+k)log(n+k)) time algorithm mentioned in the introduction.
The inputs to the compared algorithms were generated as follows: We picked random binary trees denoted S and G, with number of leaves ranging from 100 to 1000. For each number of leaves, we randomly created 10 such pairs of trees, and ran both the naive and the efficient algorithms on both datasets. Fig. 6(B) summarizes the measured time results. The green triangles correspond to the average of the time measured for the efficient version of the algorithm, and the red circles correspond to an average of the time measured for the naive version of the algorithm.
We found that as we increase the number of leaves, the differences in practical running times between the naive and the efficient algorithms become more significant. Furthermore, as expected in practice, the running time of the efficient algorithm is linear in the input size, while that of the naive one behaves as a non-linear function. The sought pattern is (({HT}, red, True), ({S, D, HT}, black, False)), which codes for two patterns, one of a massive Horizontal Transfer events from red to red (right subtree) and the other is all black events (left subtree). The figure shows the top-scoring subtree, and the corresponding sequences. The blue rectangle marks the mutation characterizing the sequences in the leaves of the right subtree: this insertion was identified as a BlaR (signal transducer) domain.

Example: RSAM Discovery in a Beta Lactamase.
Beta lactamases are versatile enzymes conferring resistance to the Beta lactam antibiotics, found in a diversity of bacterial sources. Their commonality is the ability to hydrolyze chemical compounds containing a Beta lactam ring (Bush (2018)). The secretion of antimicrobial compounds is an ancient mechanism with clear survival benefits for microbes competing with other microorganisms. Consequently, mechanisms that confer resistance are also ancient and may represent an underestimated reservoir in environmental bacteria (Bush (2018)). Antibiotic resistance factors, conferring adaptation to the pathogenesis environment, are widely spread by horizontal gene transfer mechanisms like conjugation, transformation and transduction (Navarro (2006); Poirel et al. (2009)). The persistent exposure of bacterial strains to a multitude of Beta lactams has induced dynamic and continuous pro-duction and mutation of Beta lactamases in these bacteria, expanding their activity even against the newly developed Beta lactam antibiotics (Stapleton et al. (2016)). Thus, an important objective is to identify mutations in Beta lactamase genes conferring adaptation to human and animal hosts.
Motivated by the above, we exemplify a microbiological application of RSAM-finder to the discovery of RSAMs in Beta Lactamase genes that confer adaptation to human and animal hosts. To this end, we use the pattern (({HT}, red, True), ({S, D, HT}, black, False)) to the discovery of RSAMs in Beta Lactamase genes that confer adaptation to human and animal hosts. Here, colors represent a binary environmental annotation: human and animal host (219 species) were annotated "red", while species associated with all other habitats (324 species), such as soil, water and plant, were annotated "black".
Among the known classes (A-D) of Beta lactamase, class D (represented by COG2602) is considered to be the most diverse (Evans and Amyes (2014)). Thus, we selected COG2602 (622 genes in 543 genomes) as the dataset for our example. Parameters were set as follows: k = 50, the minimum size required per sought subtree was set to 0.1 of the total number leaves of G, c ∆ = c Θ = 1 and c Σ = c Loss = 0. A figure displaying G, where the topranking RSAM node is marked with a star, is given in the supplementary materials. Also provided are the corresponding sequences, a figure displaying the corresponding S, and σ.
Within the top-ranking result for this query, we were interested in the subtree matching the first part of pattern (i.e. enrichment in red-to-red HT edges). The gene set represented by the leaves of this subtree (denoted "identified gene set") was found to be enriched in an additional domain, BlaR, a signal transducer membrane protein regulating Beta lactamase production (87/119 in the identified gene set versus 118/622 in the background, p-val = 3.94e-52). The only transcriptional regulator currently known for Beta lactamase genes is the repressor protein BlaI, previously predicted to operate in a two-component regulatory system together with BlaR in Class A Beta lactamase (Alksne and Rasmussen (1997)). The positions adjacent to the instances of the identified gene set in the corresponding genomes were found to be enriched in BlaI (70/119 of the identified gene set instances versus 90/622 of the background gene set instances, hypergeometric p-value = 1.11e-41). Note that this result was obtained with c Loss set to 0. When repeating the experiment with c Loss = 1, this result is still found among the the two top ranking vertices.
In contrast to the identified gene set, the genes represented by the subtree that matches the second part of the pattern (frequent black events of all types) are not enriched in the BlaR domain (2/36), nor is there contextual enrichment in BlaI (4/36) in positions immediately adjacent to instances of these genes. Applying RSAM-finder to this data with simpler queries that take into account only enrichment in environmental coloring does not yield this result, nor does the application of RSAM-finder to this data with any part of the pattern on its own.
The identified gene set for this result spans a wide range of Firmicutes, including both pathogenic (e.g. staphylococcus) and non-pathogenic species (e.g. various gut microbes from the Clostridiales order). Homology between BlaR receptor proteins and the extra-cellular domain of Class D Betalactamases was previously observed (Massidda et al. (1996); Brandt et al. (2017)), mainly in gram-negative bacteria (with focus on clinical samples). Thus, RSAM-finder identifies a putative Beta lactamase system in gram positive bacteria, consisting of a COG2602-BlaR Beta lactamase-receptor protein and its BlaI family repressor, predicted to confer adaptation to animal and human host environment. Further comparative sequence-level analysis (Toth et al. (2016)) may reveal the affinity of this Beta lactamase system to specific Beta lactam drugs.

Conclusions
We defined a new optimization problem in the DLT reconciliation domain. The input to this problem consists of a gene tree, constructed for a given gene orthology group, a species tree constructed for the species harboring one or more members of this gene orthology group, and a pattern representing a sought scenario in the reconciliation of the two trees. The sought pattern could imply some evolutionary process of interest, such as e.g. a gene conferring adaptation of the species to a specific environmental niche. The goal of the problem is to compute, for any vertex in the gene tree, a score reflecting the probability that the genomic mutations associated with the edge leading into this vertex confer the occurrence of the sought pattern within high-scoring reconciliations of the subtree rooted by this vertex with corresponding subtrees in the species trees.
To solve this new problem, and overcome some of the noise associated with gene tree and species tree reconstruction, we proposed an algorithm that first constructs a hypergraph H that stores information regarding the k-best DLT reconciliation scenarios for a given problem instance. The time complexity of the algorithm we propose for the construction of this hypergraph is O(m·n·k), which is essentially optimal since the number of vertices (and hence also the size) of the hypergraph can be as large as Ω(m · n · k).
Interesting open problems include the goal of extending the tool to handle more robust variations of phylogenies, such as polytomies and phylogenetic networks. It may also be helpful to consider bootstrapping methods to train the parameters and thresholds utilized by RSAM-Finder.

Acknowledgments
This work was supported by ISF grants no. 1176/18 and no. 939/18 and by the Lynn and William Frankel Center for Computer Science.

Disclosure Statement
No competing financial interests exist.