Abstract 1 Introduction 2 Preliminaries 3 Simple Pairwise Distance Computation using Suffix Tree 4 Our Framework for Shortest Undirected Paths in de Bruijn Graphs 5 Application: Connecting de Bruijn Graphs Efficiently References

Shortest Undirected Paths in de Bruijn Graphs

Wiktor Zuba ORCID CWI, Amsterdam, The Netherlands
University of Warsaw, Poland
Oded Lachish ORCID Birkbeck, University of London, UK Solon P. Pissis ORCID CWI, Amsterdam, The Netherlands
Vrije Universiteit, Amsterdam, The Netherlands
Abstract

Computing shortest directed paths in de Bruijn graphs is well studied and well understood. This is not the case for computing undirected paths, which is much more challenging algorithmically. In this paper, we present a general framework for computing shortest undirected paths in arbitrary de Bruijn graphs, that is, arbitrary subgraphs of the complete de Bruijn graph. We then present an application of our techniques for making any arbitrary order-k de Bruijn graph G(V,E) weakly connected by adding a set of edges of minimum total cost. This improves the running time of the recent (22/d)-approximation algorithm by Bernardini et al. [CPM 2024] from 𝒪(k|V|2) to 𝒪(k|V|logd) time, where d is the number of weakly connected components of graph G.

Keywords and phrases:
string algorithm, graph algorithm, de Bruijn graph, Eulerian graph
Funding:
Wiktor Zuba: Supported in part by the Netherlands Organisation for Scientific Research (NWO) through Gravitation-grant NETWORKS that has received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement No 101034253.
Solon P. Pissis: Supported in part by the PANGAIA and ALPACA projects that have received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreements No 872539 and 956229, respectively.
Copyright and License:
[Uncaptioned image] © Wiktor Zuba, Oded Lachish, and Solon P. Pissis; licensed under Creative Commons License CC-BY 4.0
2012 ACM Subject Classification:
Theory of computation Pattern matching
Funding:
A research visit during which part of the presented ideas were conceived was funded by a Royal Society International Exchanges Award.
Editors:
Paola Bonizzoni and Veli Mäkinen

1 Introduction

We start with some basic definitions and notation from [5]. An alphabet Σ is a finite set of elements called letters. We consider an integer alphabet Σ=[0,σ). Let X=X[0]X[n1] be a string of length n=|X| over Σ. By Σk we denote the set of all strings of length k>0. For two indices i and ji of X, X[i..j] is the fragment of X starting at position i and ending at position j. The fragment X[i..j] is an occurrence of the underlying substring P=X[i]X[j]; we say that P occurs at position i in X. A prefix of X is a substring of the form X[0..j] and a suffix of X is a substring of the form X[i..n1]. By XY or XY we denote the concatenation of strings X and Y: XY=X[0]X[|X|1]Y[0]Y[|Y|1].

Let us fix a collection 𝒮 of strings over alphabet Σ. We define the order-k de Bruijn graph (dBG, in short) of 𝒮 as a directed multigraph, denoted by G𝒮,k(V,E), where V is the set of length-k substrings of the strings in 𝒮 and E has an edge (u,v) with multiplicity mu,v if and only if the strings u[0]v and uv[k1] are equal and occurring exactly mu,v times in total in the strings of collection 𝒮. In bioinformatics, 𝒮 models a collection of DNA sequences coming from a genome sample through a sequencing experiment, and any Eulerian trail of G𝒮,k(V,E) – a graph walk using each edge of G𝒮,k(V,E) exactly once – represents a potential reconstruction of the genome [21, 18]. Inspect Figure 1. This is an idealized model though, since G𝒮,k would never be Eulerian in practice due to sequencing errors [19]; and, furthermore, G𝒮,k would not even be weakly connected. We could make G𝒮,k Eulerian by increasing the multiplicity of some of its existing edges [17] or introducing new ones [5]. In either case, the natural optimization goal is to minimize the total cost (number) of the added edges. In fact, many algorithms underlying genome assembly tackle similar problems [24, 1, 9, 23].

We also define the complete de Bruijn graph of order k over the alphabet Σ, denoted by GΣ,k(V,E), as a directed graph, where V is the set of all the strings from Σk and E contains an edge (u,v) if and only if string v is obtained from string u by appending letter v[k1] after its last position and removing letter u[0]: GΣ,k(V,E) has |Σ|k nodes and |Σ|k+1 edges.

Figure 1: Let 𝒮 be a string collection consisting of strings AAT, ATA, ATT, ATT, TAA, TAT, TTA, TTA. Note that, for example, string ATT has multiplicity 2 in 𝒮. The de Bruijn graph G𝒮,2(V,E) is presented on the left; the complete set of node-distinct Eulerian trails from s to t are presented in the middle – we call two trails node-distinct if their node sequences are different in accordance with the existing literature [3, 10]. Node-distinct Eulerian trails correspond to distinct strings; the corresponding set of possible string reconstructions is presented on the right.

Our Motivation.

Bernardini et al. [4] studied the problem of making any arbitrary G𝒮,k weakly connected by introducing a set of new edges of minimum total cost (as well as the underlying set of new nodes when those do not exist in G𝒮,k); by arbitrary G𝒮,k, we mean an arbitrary subgraph of the complete dBG GΣ,k. Note that making any arbitrary graph weakly connected by introducing a set of new edges of minimum total cost is trivial: if the graph consists of d components, the solution is to greedily add d1 edges to connect the components. The algorithmic challenge with dBGs follows by their definition: not every pair of nodes can be connected via a single edge. Solving this problem is important because we can then use the linear-time algorithm of Bernardini et al. from [5] to balance the weakly connected graph by adding a set of edges of minimum total cost thus making G𝒮,k Eulerian.111By Euler’s famous theorem, we know that a weakly connected directed graph is Eulerian if and only if every graph node has equal in-degree and out-degree (except perhaps for the source and target). Recall that making G𝒮,k directly Eulerian by adding a set of new edges of minimum total cost is NP-hard from the well-known shortest common superstring problem [12]; hence, the connect-and-balance strategy is reasonable and generally serves as a good-performing heuristic [5]. Two remarks are in order: first, since the general task is to connect G𝒮,k, we can safely assume that mu,v is either 0 or 1 (i.e., multiplicities play no role here); and second, since the task, in particular, is to connect G𝒮,k with the smallest number of edge additions, we should rather seek shortest undirected paths in G𝒮,k (i.e., shortest sequences of edges from GΣ,k, in which the edge directions are neglected). Inspect Figure 2.

Figure 2: A de Bruijn graph of order k=4 with 3 weakly connected components (left); an optimal solution, using shortest undirected paths, with cost 3 (middle); a feasible solution, using shortest directed paths, with cost 8 (right). We color blue the edges and nodes that we add from GΣ,k.

Bernardini et al. [4] showed that making G𝒮,k weakly connected by adding a set of edges of minimum total cost is NP-hard. They also showed that no polynomial-time approximation scheme (PTAS) exists for making G𝒮,k weakly connected by adding a set of edges of minimum total cost (unless the unique games conjecture [13] fails). Finally, they also showed that there exists an 𝒪(k|V|2)-time (22/d)-approximation algorithm for the same problem, where d is the number of connected components of G𝒮,k. In this paper, we introduce a general framework for finding shortest undirected paths in dBGs. In particular, using our framework, we improve the running time of the approximation algorithm of Bernardini et al. from 𝒪(k|V|2) to 𝒪(k|V|logd), while maintaining the same approximation ratio.

Our Framework.

Let us fix d families C1,C2,,Cd of nodes (forming connected components in most applications) from GΣ,k, and let us denote V=C1C2Cd.222We assume that these families are pairwise disjoint. However, our solutions work without this assumption. Throughout this paper, we treat nodes of dBGs and their length-k string representations as equivalent: indeed, nodes of GΣ,k are in a natural bijection with Σk. In particular, Cp, for any p[1,d], and V are also treated as sets of strings.

We generally aim at obtaining efficient algorithms for finding the minimal distance between the nodes from two different families Cp and Cq; more formally, for any p,q[1,d],

dist(p,q)=min{dist(u,v):uCp,vCq},

where dist(u,v) is the length of a shortest undirected path from node u to node v in GΣ,k.

Note that if Cp and Cq form two weakly connected components of GΣ,k, then dist(p,q) is precisely the minimum number of edges that must be added to connect the two components into a single one – assuming that we also add the nodes implied by those edges. In the same manner, by adding d1 such undirected paths, we can (greedily) connect C1,C2,,Cd to a single component, thus making any arbitrary dBG G𝒮,k weakly connected.

The problem of finding the shortest directed path between any two nodes of GΣ,k can be solved in the optimal 𝒪(k) time using the preprocessing of the classic KMP algorithm [14]. The same problem for undirected paths can also be solved in the optimal 𝒪(k) time [16]. By iteratively applying the latter result, we can compute dist(p,q) in 𝒪(k|Cp||Cq|) time and for all p,q pairs in 𝒪(k|V|2) total time, which is very slow when |V| is large, even if the number d of families is relatively small. Using more refined techniques, based on the generalized suffix tree [25] of the strings from V, we develop algorithms for finding the distances much more efficiently when the families are large or when there are many of them.

In particular, given the collection C1,C2,,Cd, we consider the following problems:

  • One-to-One(p,q): output dist(p,q). Here we are given, in addition, p and q, and we are asked to find the length of a shortest undirected path between any node of Cp and any node of Cq.

  • One-to-All(p): output dist(p,q), for every q[1,d]. Here we are given, in addition, p, and we are asked to find the length of a shortest undirected path between any node of Cp and any node of Cq, for every q[1,d].

  • All-to-All: output dist(p,q), for all p,q[1,d]. Here we are asked to find the length of a shortest undirected path between any node of Cp and any node of Cq, for all p,q[1,d].

  • Top(r): Here we are given, in addition, r, and we are asked to output, for every q[1,d], r distinct p[1,d] with the smallest value of dist(p,q), breaking ties arbitrarily.

Let us remark that the algorithms underlying our framework are constructive – whenever we compute dist(p,q), we also know a pair uCp,vCq of nodes that are at distance dist(p,q). By applying the technique from [16], we can enhance the output of the above algorithms with optimal paths realising those distances at the additional linear cost in the size of the output – k times the length of a shortest path (every node is explicitly encoded using k letters). If the output is given in a compacted form – the differences between two consecutive nodes on the path (the new letter introduced and whether it is put in the front or back) – then the additional cost reduces to the length of the shortest path.

Our Results.

We make the following specific contributions:

  • an algorithm solving One-to-One(p,q) in 𝒪(k(|Cp|+|Cq|)) time and space;

  • an algorithm solving One-to-All(p) in 𝒪(k|V|) time and space;

  • an algorithm solving All-to-All in 𝒪(dk|V|) time and 𝒪(k|V|) space;

  • an algorithm solving Top(r) in 𝒪(rk|V|) time and space.

Application.

By plugging our results directly in the approximation algorithm of Bernardini et al. [4], we improve the running time of their algorithm for connecting dBGs from 𝒪(k|V|2) to 𝒪(dk|V|). Using more refined techniques, we obtain an even further improvement; in particular, we design an 𝒪(k|V|logd)-time algorithm for the same task, while maintaining the same approximation ratio of (22/d) [4].

Paper Organization.

In Section 2, we present some preliminaries that essentially summarize the work in [16]. In Section 3, we present a simple linear-time algorithm for computing shortest paths in undirected dBGs. In Section 4, we present our framework: how distances between sets of nodes can be computed more efficiently in many settings. In Section 5, we apply our framework to the problem of making any arbitrary dBG weakly connected [4].

2 Preliminaries

Let [k] denote the set {0,1,,k1}, and let S[i..j]=S[i..j+1) denote the substring S[i]S[i+1]S[j] of S. Let S1,S2Σk represent nodes v1 and v2 (respectively) of GΣ,k.

Let U be a common substring of S1 and S2 and assume that it occurs in those strings at positions i and j respectively, with ij. Notice that we can transform S1 into S2 by first removing the first i letters of S1 (appending i arbitrary letters at its end at the same time), then adding the first j letters of S2 to its front (removing j letters from its end – in particular all the letters added in the previous step), and then symmetrically remove the last kj|U| letters and add the length-(kj|U|) suffix of S2 in their place. In total, this requires i+j+2(kj|U|)=2k2|U|(ji) operations, and each such operation corresponds to an edge of the complete dBG. It turns out that one cannot get a path from v1 to v2 of shorter length other than by choosing a different common substring or different occurrences. This fact is summarized in Lemma 1 by Liu [16]. Inspect Figure 3.

Lemma 1 ([16]).

Let v1,v2 be two nodes of GΣ,k and S1,S2Σk be the string representations of v1,v2, respectively. Then dist(v1,v2)=2kmaxi,j[k](2|Ui,j|+|ji|), where Ui,j is the longest common prefix of S1[i..k1] and S2[j..k1].

Figure 3: Consider the strings S1 and S2 for k=21. The distance from S1 to S2 in the directed de Bruijn graph is equal to 19 (CT is the longest suffix of S1 that is a prefix of S2), and the distance from S2 to S1 is equal to 18 (AAA is the longest suffix of S2 that is a prefix of S1). In the undirected case, the distance between the two nodes is 221297=17 witnessed by the common substring C4TC4. Notice that even though C10 is a longer common substring it cannot be used to obtain a shortest path between the nodes because it appears in S1 at position 3 and in S2 at position 7 (and 210+4<29+7). This example shows that it is not enough to look at prefixes, suffixes or the longest common substring when computing the distance in the undirected de Bruijn graph.

Suffix Tree.

The classic indexing solution for standard strings is the suffix tree [25]. Given a set of strings, the compacted trie of these strings is the trie obtained by compressing each path of nodes of degree one in the trie of the strings in . The dissolved nodes are called implicit while the preserved nodes are called explicit. For any node u (implicit of explicit) of the compacted trie, str(u) denotes the string spelled from the root of the trie to u; the string depth of u is then defined as d(u)=|str(u)|. Each edge in the compacted trie has a label represented as a fragment of a string in . Thus, the compacted trie uses 𝒪(||) space [20]. The suffix tree ST(S) is the compacted trie of the suffixes of string S. Assuming S ends with a unique terminating symbol $, every suffix S[i..|S|) of S is represented by a leaf annotated with index i (the starting position of the suffix in S). The root node, the branching nodes, and the leaf nodes form the set of explicit nodes of ST(S); inspect Figure 4. The suffix tree occupies 𝒪(|S|) space and it can be constructed in 𝒪(|S|) time [25, 11]. The edges of ST(S) can be accessed in 𝒪(1) time if stored using perfect hashing [2]. The suffix tree then supports pattern matching queries for any pattern P of length m in 𝒪(m+|Occ|) time, where Occ is the set of output occurrences of P in S. The suffix tree can also be generalized to a collection 𝒮={S1,,SN} of N strings as the compacted trie ST(𝒮) of the suffixes of string S1$1SN$N, where $iΣ, for i[1,N], are distinct terminating symbols.

Figure 4: The suffix tree ST(S) of string S=CAGAGA$. The node representing string AG is implicit and thus it is not stored explicitly in this compacted trie. The node u representing string AGA is explicit (it is a branching node) and thus it is stored; we have str(u)=AGA and d(u)=3=|AGA|.

3 Simple Pairwise Distance Computation using Suffix Tree

Let I1(U)={i:S1[i..i+|U|)=U} and let I2(U)={j:S2[j..j+|U|)=U}. The sets I1(U) and I2(U) represent the occurrences of string U in strings S1 and S2, respectively. The distance dist(v1,v2) as described in Lemma 1 can also be expressed as follows: 333If I1(U)= or I2(U)=, then the distance for this U as witness is equal to , and hence the distance remains the same whether such U are considered or not, thus the abstract formula can iterate over Σ.

dist(v1,v2)=2kmaxU(2|U|+max[max(I1(U))min(I2(U)),max(I2(U))min(I1(U))]). (1)

Equation 1 changes the order of the maxima: the outer one is over any substring U, and the inner one is over the occurrences (starting positions) of U. We will next view Equation 1 through the lens of the generalized suffix tree of S1 and S2.

Let ST({S1,S2}) be the generalized suffix tree of S1 and S2. Since |S1|=|S2|=k, ST({S1,S2}) has 𝒪(k) explicit nodes (and edges). For an explicit node v of ST({S1,S2}), let d(v) be its string depth (the length of the substring represented), Lvx=min{i:Sx[i..k1] is a leaf descendant of v}, and Rvx=max{i:Sx[i..k1] is a leaf descendant of v} for x{1,2}.

Each common substring U is represented by an explicit (or implicit) node of ST({S1,S2}). Notice, that if both U and U=Ua for some letter aΣ appear at the very same positions (both in S1 and S2), then the value of the formula for U is larger than the one for U by exactly 2, hence it is enough to focus on the explicit nodes of ST({S1,S2}) when we want to compute the optimal value. Hence Equation 1 can also be expressed as follows:

dist(v1,v2)=2kmaxvST({S1,S2})(2d(v)+max[Rv1Lv2,Rv2Lv1]). (2)
Observation 2.

Lvx=minwChildren(v)Lwx (respectively Rvx=maxwChildren(v)Rwx) for a branching node v, where Children(v) is the set of direct descendants of v in the suffix tree. For a leaf v, the set from which Lvx (resp. Rvx) is taken is either a singleton or an empty set.

A direct consequence of Observation 2 is that the values Lv1,Rv1,Lv2, and Rv2 can be computed bottom up. We thus compute these 4 values using a bottom-up traversal (and the depth d(v) via a top-down traversal) for each node in 𝒪(k) total time. This gives a simple 𝒪(k)-time algorithm for computing dist(u1,u2) – after constructing the suffix tree and computing those values, it suffices to find the optimal value of Equation 2 over all the explicit nodes.

Lemma 3.

Let v1,v2 be two nodes of GΣ,k and S1,S2Σk be the string representations of v1,v2, respectively. We can compute dist(v1,v2) in 𝒪(k) time using ST({S1,S2}).

We note that a different, yet much more complicated, 𝒪(k)-time algorithm for computing dist(v1,v2) using suffix trees has already been given by Liu in [16].

Recall that dist(p,q)=minv1Cp,v2Cqdist(v1,v2). A naïve application of Lemma 3 for computing dist(p,q) by explicitly computing the pairwise distance between all the pairs of nodes runs in 𝒪(k|Cp||Cq|) time. In the next section, we present our framework: how distances between sets of nodes can be computed more efficiently in many settings.

4 Our Framework for Shortest Undirected Paths in de Bruijn Graphs

In this section, we provide simple and efficient solutions to the considered distance (shortest-path) problems. Our solutions rely only on the generalized suffix tree of the strings in question (V or CpCq) and standard operations (graph traversals and dynamic programming) and hence are not only of theoretical interest, but should also admit efficient implementations.

4.1 One-to-One

Recall that in One-to-One, we are given C1,C2,,Cd, p and q, and we are asked to find the length of a shortest undirected path between any node of Cp and any node of Cq.

Note that in Equation 1, it does not really matter from which string in Cp (resp. Cq) the occurrence of U at position i (resp. j) originates: by changing the order of the minima in the formula, we have that dist(p,q) can be expressed by Equation 1 if we naturally extend the set Ix(U)={i:SCxS[i..i+|U|)=U}, for x{p,q}. In particular, with the use of Lvx,Rvx, for x{p,q}, based on the suffixes of all the strings representing Cx, Equation 2 generalizes to the following one:

dist(p,q)=2kmaxvST(CpCq)(2d(v)+max[RvpLvq,RvqLvp]). (3)

Inspect Figure 5.

Figure 5: The information [d(v),Lvp,Rvp] computed for the explicit non-leaf nodes v of ST(CpCq) restricted to the part of the suffix tree where the only edge going out of the root is labeled by letter A. Let UCq, V,WCp, and k=13, for U=CBDCCCABCABCE, V=CDABCABCDEFGB, and W=BABCDEABCDEFG. The labels on the edges leading to leaves are compacted for simplicity (and represented only with the suffixes X1,,X4). By additionally computing Luq=6 and Ruq=9, for the explicit node u representing ABC, we get the distance 21323max(91,66)=12 witnessed by the common substring ABC. By comparing the distances witnessed by all the common substrings (nodes of the suffix tree), we obtain the minimal distance 10 witnessed by ABCABC.

Since the size of ST(CpCq) is in 𝒪(k(|Cp|+|Cq|)), and it can be constructed in linear time, by repeating the same operations as for the computation of dist(v1,v2), we obtain:

Theorem 4.

We can solve One-to-One in 𝒪(k(|Cp|+|Cq|)) time and space.

4.2 One-to-All

Recall that in One-to-All(p), we are given C1,C2,,Cd and p, and we are asked to find the length of a shortest undirected path between any node of Cp and any node of Cq, for every q[1,d]. Using Theorem 4, we directly obtain a solution to One-to-All(p) and to All-to-All running in 𝒪(k(|V|+d|Cp|)) and 𝒪(dk|V|) time, respectively, using 𝒪(kmaxq|Cq|) space. We now proceed to a more refined processing of the suffix tree that allows us to obtain a more efficient algorithm for One-to-All, which is later used to solve Top(r). Our high-level goal is to reduce the number of nodes over which we must optimize Equation 3.

Let A(v) denote the set of all ancestors of v in ST(V) (including the node itself). Further let Minvx=maxwA(v)[2d(w)Lwx], and Maxvx=maxwA(v)[2d(w)+Rwx], for x[1,d]. Recall that V=C1C2Cd. Recall also that as noted in Observation 2 for the leaf nodes of ST(V), denoted by Leaves(ST(V)), the sets Ix(U) are either equal to a singleton {ivx} (when S[ivx..k1]=U for SCx is represented by node v) or empty. 444Note that for a leaf v, d(v)=kivx; that is, the label $ is not taken into account in computation of the length of the common substring represented – this plays a role only when computing dist(p,p) anyway. In particular ivx exists only for a single x (as symbols $x are distinct). We can thus define Equation 4 as a more refined version of Equation 3:

dist(p,q)=2kmaxvLeaves(ST(V))(max[Maxvpivq,Minvp+ivq]). (4)
Example 5.

Consider the example from Figure 5: U=CBDCCCABCABCECq, V=CDABCABCDEFGBCp, and W=BABCDEABCDEFGCp. Consider the leaf nodes representing the strings V[2..12] and U[6..12]. Both leaf nodes have the following ancestors (other than themselves) and the following [d,Lvp,Rvp,Lvq,Rvq] values:

  • ABCABC: [6,2,2,6,6];

  • ABC: [3,1,6,6,9];

  • empty string ε (root node): [0,0,12,0,12].

Let v be the leaf node representing U[6..12]. We iterate over all ancestors w of v:

  • Minvp=maxwA(v)[2d(w)Lwp]=max{,10,5,0}=10;

  • Maxvp=maxwA(v)[2d(w)+Rwp]=max{,14,12,12}=14.

We have 2kmax[Maxvpivq,Minvp+ivq]=26(10+6)=10, which gives us the minimal distance, between Cq and Cp witnessed by ABCABC.

The following lemma is crucial for the correctness of our approach.

Lemma 6.

Equations 3 and 4 are equivalent.

Proof.

Let v be the node of ST for which Equation 3 attains the optimal value, which w.l.o.g. is equal to 2kmax(2d(v)+RvqLvp), and let u,w be the leaf descendants of v for which Lvp=iup and Rvq=iwq.

Since v is an ancestor of w, we know that Minwp2d(v)iup, hence Minwp+iwq2d(v)iup+iwq=2d(v)Lvp+Rvq, thus the value of Equation 4 (witnessed by node w) is at least as large as the value of Equation 3 (witnessed by the node v).

For the converse inequality, w.l.o.g. the value of Equation 4 is equal to 2k[Minwp+iwq] for a leaf node w. By the definition of Minwp there exists an ancestor v of w such that Minwp=2d(v)Lvp. Hence 2d(v)Lvp+RvqMinwp+iwq (as Rvqiwq,), which shows that the value of Equation 3 cannot be smaller than the value of Equation 4.

Observation 7.

Minvp=max(MinParent(v)p,2d(v)Lvp).555The equation for Maxvp is analogous: max(MaxParent(v)p,2d(v)+Rvp).

A direct consequence of Observation 7 is that the values Minvp and Maxvp can be computed top down. By computing these 2 values using a top-down traversal in 𝒪(k|V|) total time and space for all explicit nodes and computing Equation 4 for the leaves, we obtain:

Theorem 8.

We can solve One-to-All in 𝒪(k|V|) time and space.

By directly computing the values Minvp and Maxvp, for all p[1,d], we obtain another algorithm to solve All-to-All in 𝒪(dk|V|) time. For this algorithm, the space used is 𝒪(dk|V|). By simply using Theorem 8 d times, the required space is reduced to 𝒪(k|V|). The computation of all the values in a single run over the generalized suffix tree has other nice properties, however – as shown in the next section it allows to restrict the output while also reducing the computation time and space.

4.3 Top

Recall that in Top(r), we are given C1,C2,,Cd and an integer r, and we are asked to output, for every q[1,d], r distinct p[1,d] with the smallest value of dist(p,q), breaking ties arbitrarily.

We start with the following simple yet crucial observation: If for a fixed leaf v of ST(V), the values of Minvp and Maxvp over p are ordered nonincreasingly, it suffices to know the first r of those for each type (Min and Max), that is, we do not need to compute all the 2d values.

Recall that the Minvp and Maxvp values are computed by first computing the values Lwp and Rwp bottom up and then computing the values Minwp and Maxwp top down for every node w of ST(V). If we store the best (smallest for L and largest for Min,Max and R) r values (over p[1,d]) for each of those 4 types, the values for the parent/children can be computed in time proportional to the number of nodes multiplied by r. Indeed when computing the smallest (up to) r values of Lwp, it suffices to find the r smallest elements of the values stored in the children; hence that can be computed in 𝒪(rc) time, where c is the number of children of w – this sums up to 𝒪(rk|V|) total time over all explicit nodes of ST(V). Here we use 𝒪(rc) time to exclude the duplicate values p[1,d] – a check if this set Cp is already represented can be done using an extra integer array of size d (only one array for the whole computation) with 𝒪(1)-time updates.

After the computation of Lwp (resp. Rwp) for all the nodes, the r largest values Minwp (resp. Maxwp) can be obtained using Observation 7 from the values stored in the parent node, and the r values 2d(w)Lwp (resp. 2d(w)+Rwp) – which are already sorted nonincreasingly due to the sorted order on Lwp (resp. Rwp).

Finally, we once again simply iterate over the leaves of ST(V), and gather the r smallest values of dist(p,q) over all the leaves representing a suffix of a string from Cq (using a bucket queue). We obtain the following result.

Theorem 9.

Top(r) can be computed in 𝒪(rk|V|) time and space.

5 Application: Connecting de Bruijn Graphs Efficiently

While our framework may have other applications revolving around dBGs (e.g., [6, 7]), in this section, we showcase the application of making an arbitrary dBG weakly connected.

Let us fix an arbitrary dBG of order k consisting of d weakly connected components and also denote it by G(V,E). Bernardini et al. [4] proved the following result.

Theorem 10 ([4]).

For any order-k dBG G(V,E) consisting of d weakly connected components, there exists an 𝒪(k|V|2)-time (22/d)-approximation algorithm for making G weakly connected by adding a set of edges of minimum total cost.

We improve Theorem 10 by slashing a factor of |V|/logd from the running time. Let G(V,E) be the graph obtained from the complete dBG GΣ,k by collapsing each component Cp, p[1,d], of G(V,E) into one super-node. The solution underlying Theorem 10 consists of the following three steps:

  1. (i)

    Construct the metric closure666Recall that the metric closure of a graph G is the complete graph in which each edge is weighted by the shortest path distance between the nodes in G. of G.

  2. (ii)

    Compute a minimum spanning tree of the metric closure.

  3. (iii)

    Convert the minimum spanning tree into a set of nodes and a set of edges to be added to G to make it weakly connected.

The correctness follows directly from the fact that a minimum spanning tree for the metric closure of G is a (22/d)-approximation for the minimum Steiner tree [15],777Recall that the minimum Steiner tree problem asks, given a graph G with nonnegative edge weights and a subset of terminal nodes, to compute a tree of minimum weight that contains all terminals. where d is the number of terminals and thus the number of weakly connected components of G. Step (i) requires 𝒪(k|V|2) time by applying Lemma 3. Step (ii) can be done in 𝒪(d2) time by applying, e.g., Prim’s algorithm [22]. Finally, Step (iii) can be done by applying again Lemma 3 to compute a shortest undirected path. This sums up to 𝒪(k|V|2) total time.

To complement Theorem 10, Bernardini et al. also showed that making G(V,E) weakly connected by adding a set of edges of minimum total cost is NP-hard and admits no PTAS.

Theorem 10 can be improved using our framework: Theorem 4 directly outputs the weights of the edges of the metric closure G in 𝒪(dk|V|) time; this improves the running time from 𝒪(k|V|2) to 𝒪(dk|V|) time. In particular, with the use of the Top queries, we can obtain an even more efficient solution by removing the construction of the metric closure and instead computing its spanning tree directly from our input.

Let 𝒢(𝒱,) be an arbitrary undirected weighted graph. Further, let ETop be a subset of defined by choosing, for each node v𝒱, an edge incident with v with the smallest weight (one of such edges in case of ties), and then removing, from each cycle obtained this way, the heaviest edge (breaking ties arbitrarily). We next state the following well-known lemma (cf. [8]), and also prove it here for completeness.

Lemma 11.

ETop is a subset of a minimum spanning tree of 𝒢(𝒱,).

Proof.

We show that starting from any arbitrary spanning tree of 𝒢, we can modify it using a greedy approach so that the obtained spanning tree contains all edges of ETop, and has a total weight at most as large as the weight of the initial spanning tree.

We iterate over the edges of ETop; we add the edge to the spanning tree and then remove one edge from the newly created cycle: the one with the largest weight and, among possibly many such edges, we prefer an edge that does not belong to ETop.

Clearly, such an operation cannot increase the weight of the spanning tree – the worst-case scenario is that the newly added edge is immediately removed. If by applying this procedure, we never remove any edge from ETop, then the claimed solution exists.

Towards the contradiction assume that after adding an edge e1ETop, we create a cycle in which all the heaviest edges belong to ETop. Take one such edge – it was chosen for a node v. The other edge of the cycle incident with v cannot be lighter (by the definition of ETop), hence by assumption it must have the same weight and hence must also belong to ETop (by the assumption of this paragraph of the proof); we move on to the node for which this edge was chosen, and continue in the same way. Since the cycle is finite, at some point we have to come back to v – but this means that the edges from ETop formed a cycle – a contradiction with the definition of ETop. Thus, ETop is a subset of a minimum spanning tree of 𝒢(𝒱,).

Recall that G(V,E) is the graph obtained from the complete dBG GΣ,k by collapsing each component Cp, p[1,d], of G(V,E) into one super-node. We show the following lemma (implementing the algorithm of [8]).

Lemma 12.

We can find a minimum spanning tree of G(V,E) in 𝒪(k|V|logd) time using 𝒪(k|V|) space.

Proof.

Let us remark that we do not explicitly construct G or GΣ,k.

We start by producing a set of edges ETop for G using a Top(r) query for r=2. Such a query returns for each component Cq of G, that is, equivalently for each super-node q of G, two super-nodes closest to it. In particular, what is implied by this is that even if one of those two is q itself (which happens in most cases), the other one must be different. By Theorem 9, in 𝒪(k|V|) total time, we find, for each super-node of G, one incident edge with the smallest weight possible – by taking this set of edges and removing from each cycle a single edge we obtain a valid set of edges ETop.

By Lemma 11, we can safely report this set of edges as part of the output. We can also contract the super-nodes of G connected with the edges from ETop into other single super-nodes obtaining graph G′′. Now every spanning tree of G that contains ETop is equivalent to the union of a spanning tree of G′′ and the set ETop, hence the problem reduces to finding the minimum spanning tree of G′′.

Notice that G′′ is represented by the very same input to our original problem on a dBG, just with some of the sets Cp merged together, and so we can use the same generalized suffix tree just with different labels p,q. We can thus repeat the same approach until we reach a graph with a single super-node. Note that each such iteration takes 𝒪(k|V|) time (Theorem 9), and that there can be no more than log2d such iterations because each time every super-node gets connected to another one, the number of super-nodes (components) drop by at least the factor 2 – the statement follows.

By applying Lemma 12 to the solution from [4] we obtain the following improved result.

Theorem 13.

For any order-k dBG G(V,E) consisting of d weakly connected components, there exists an 𝒪(k|V|logd)-time (22/d)-approximation algorithm for making G weakly connected by adding a set of edges of minimum total cost.

Since the algorithm underlying Theorem 13 is near-optimal, the main open question is whether we can improve the approximation ratio.

References

  • [1] Nidia Obscura Acosta, Veli Mäkinen, and Alexandru I. Tomescu. A safe and complete algorithm for metagenomic assembly. Algorithms Mol. Biol., 13(1):3:1–3:12, 2018. doi:10.1186/S13015-018-0122-7.
  • [2] Michael A. Bender, Alex Conway, Martin Farach-Colton, William Kuszmaul, and Guido Tagliavini. Iceberg hashing: Optimizing many hash-table criteria at once. J. ACM, 70(6):40:1–40:51, 2023. doi:10.1145/3625817.
  • [3] Giulia Bernardini, Huiping Chen, Gabriele Fici, Grigorios Loukides, and Solon P. Pissis. Reverse-safe text indexing. ACM J. Exp. Algorithmics, 26:1.10:1–1.10:26, 2021. doi:10.1145/3461698.
  • [4] Giulia Bernardini, Huiping Chen, Inge Li Gørtz, Christoffer Krogh, Grigorios Loukides, Solon P. Pissis, Leen Stougie, and Michelle Sweering. Connecting de Bruijn Graphs. In Shunsuke Inenaga and Simon J. Puglisi, editors, 35th Annual Symposium on Combinatorial Pattern Matching (CPM 2024), volume 296 of Leibniz International Proceedings in Informatics (LIPIcs), pages 6:1–6:16, Dagstuhl, Germany, 2024. Schloss Dagstuhl – Leibniz-Zentrum für Informatik. doi:10.4230/LIPIcs.CPM.2024.6.
  • [5] Giulia Bernardini, Huiping Chen, Grigorios Loukides, Solon P. Pissis, Leen Stougie, and Michelle Sweering. Making de Bruijn graphs Eulerian. In Hideo Bannai and Jan Holub, editors, 33rd Annual Symposium on Combinatorial Pattern Matching, CPM 2022, June 27-29, 2022, Prague, Czech Republic, volume 223 of LIPIcs, pages 12:1–12:18. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2022. doi:10.4230/LIPICS.CPM.2022.12.
  • [6] Giulia Bernardini, Chang Liu, Grigorios Loukides, Alberto Marchetti-Spaccamela, Solon P. Pissis, Leen Stougie, and Michelle Sweering. Missing value replacement in strings and applications. Data Min. Knowl. Discov., 39(2):12, 2025. doi:10.1007/S10618-024-01074-3.
  • [7] Giulia Bernardini, Alberto Marchetti-Spaccamela, Solon P. Pissis, Leen Stougie, and Michelle Sweering. Constructing strings avoiding forbidden substrings. In Pawel Gawrychowski and Tatiana Starikovskaya, editors, 32nd Annual Symposium on Combinatorial Pattern Matching, CPM 2021, July 5-7, 2021, Wrocław, Poland, volume 191 of LIPIcs, pages 9:1–9:18. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2021. doi:10.4230/LIPICS.CPM.2021.9.
  • [8] Otakar Borůvka. O jistém problému minimálním. Práce Moravské přírodovědecké společnosti, sv. III(spis 3):37–58, 1926. URL: https://dml.cz/handle/10338.dmlcz/500114.
  • [9] Massimo Cairo, Shahbaz Khan, Romeo Rizzi, Sebastian Schmidt, Alexandru I. Tomescu, and Elia C. Zirondelli. The hydrostructure: a universal framework for safe and complete algorithms for genome assembly, 2021. arXiv:2011.12635.
  • [10] Alessio Conte, Roberto Grossi, Grigorios Loukides, Nadia Pisanti, Solon P. Pissis, and Giulia Punzi. Beyond the BEST theorem: Fast assessment of eulerian trails. In Evripidis Bampis and Aris Pagourtzis, editors, Fundamentals of Computation Theory - 23rd International Symposium, FCT 2021, Athens, Greece, September 12-15, 2021, Proceedings, volume 12867 of Lecture Notes in Computer Science, pages 162–175. Springer, 2021. doi:10.1007/978-3-030-86593-1_11.
  • [11] Martin Farach. Optimal suffix tree construction with large alphabets. In 38th Annual Symposium on Foundations of Computer Science, FOCS ’97, Miami Beach, Florida, USA, October 19-22, 1997, pages 137–143, 1997. doi:10.1109/SFCS.1997.646102.
  • [12] John Gallant, David Maier, and James A. Storer. On finding minimal length superstrings. J. Comput. Syst. Sci., 20(1):50–58, 1980. doi:10.1016/0022-0000(80)90004-5.
  • [13] Subhash Khot. On the power of unique 2-prover 1-round games. In Proceedings of the 17th Annual IEEE Conference on Computational Complexity, Montréal, Québec, Canada, May 21-24, 2002, page 25. IEEE Computer Society, 2002. doi:10.1109/CCC.2002.1004334.
  • [14] Donald E. Knuth, James H. Morris Jr., and Vaughan R. Pratt. Fast pattern matching in strings. SIAM J. Comput., 6(2):323–350, 1977. doi:10.1137/0206024.
  • [15] Lawrence T. Kou, George Markowsky, and Leonard Berman. A fast algorithm for Steiner trees. Acta Informatica, 15:141–145, 1981. doi:10.1007/BF00288961.
  • [16] Zhen Liu. Optimal routing in the de Bruijn networks. In 10th International Conference on Distributed Computing Systems (ICDCS 1990), May 28 - June 1, 1990, Paris, France, pages 537–544. IEEE Computer Society, 1990. doi:10.1109/ICDCS.1990.89261.
  • [17] Paul Medvedev, Konstantinos Georgiou, Gene Myers, and Michael Brudno. Computability of models for sequence assembly. In 7th WABI, volume 4645 of Lecture Notes in Computer Science, pages 289–301. Springer, 2007. doi:10.1007/978-3-540-74126-8_27.
  • [18] Paul Medvedev and Mihai Pop. What do Eulerian and Hamiltonian cycles have to do with genome assembly? PLOS Computational Biology, 17(5):1–5, May 2021. doi:10.1371/journal.pcbi.1008928.
  • [19] Jason R. Miller, Sergey Koren, and Granger Sutton. Assembly algorithms for next-generation sequencing data. Genomics, 95(6):315–327, 2010. doi:10.1016/j.ygeno.2010.03.001.
  • [20] Donald R. Morrison. PATRICIA - practical algorithm to retrieve information coded in alphanumeric. J. ACM, 15(4):514–534, 1968. doi:10.1145/321479.321481.
  • [21] Pavel A. Pevzner, Haixu Tang, and Michael S. Waterman. An Eulerian path approach to DNA fragment assembly. Proc Natl Acad Sci, 98(17):9748–9753, 2001. doi:10.1073/pnas.171285098.
  • [22] Robert C. Prim. Shortest connection networks and some generalizations. Bell System Technical Journal, 36:1389–1401, 1957. doi:10.1002/j.1538-7305.1957.tb01515.x.
  • [23] Sebastian Schmidt, Shahbaz Khan, Jarno N Alanko, Giulio E Pibiri, and Alexandru I Tomescu. Matchtigs: minimum plain text representation of k-mer sets. Genome Biology, 24(1):136, 2023. doi:10.1186/s13059-023-02968-z.
  • [24] Alexandru I. Tomescu and Paul Medvedev. Safe and complete contig assembly through omnitigs. J. Comput. Biol., 24(6):590–602, 2017. doi:10.1089/CMB.2016.0141.
  • [25] Peter Weiner. Linear pattern matching algorithms. In 14th Annual Symposium on Switching and Automata Theory, Iowa City, Iowa, USA, October 15-17, 1973, pages 1–11. IEEE Computer Society, 1973. doi:10.1109/SWAT.1973.13.