Shortest Undirected Paths in de Bruijn Graphs
Abstract
Computing shortest directed paths in de Bruijn graphs is well studied and well understood. This is not the case for computing undirected paths, which is much more challenging algorithmically. In this paper, we present a general framework for computing shortest undirected paths in arbitrary de Bruijn graphs, that is, arbitrary subgraphs of the complete de Bruijn graph. We then present an application of our techniques for making any arbitrary order- de Bruijn graph weakly connected by adding a set of edges of minimum total cost. This improves the running time of the recent -approximation algorithm by Bernardini et al. [CPM 2024] from to time, where is the number of weakly connected components of graph .
Keywords and phrases:
string algorithm, graph algorithm, de Bruijn graph, Eulerian graphFunding:
Wiktor Zuba: Supported in part by the Netherlands Organisation for Scientific Research (NWO) through Gravitation-grant NETWORKS that has received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement No 101034253.Copyright and License:
![[Uncaptioned image]](x1.png)
2012 ACM Subject Classification:
Theory of computation Pattern matchingFunding:
A research visit during which part of the presented ideas were conceived was funded by a Royal Society International Exchanges Award.Editors:
Paola Bonizzoni and Veli MäkinenSeries and Publisher:

1 Introduction
We start with some basic definitions and notation from [5]. An alphabet is a finite set of elements called letters. We consider an integer alphabet . Let be a string of length over . By we denote the set of all strings of length . For two indices and of , is the fragment of starting at position and ending at position . The fragment is an occurrence of the underlying substring ; we say that occurs at position in . A prefix of is a substring of the form and a suffix of is a substring of the form . By or we denote the concatenation of strings and : .
Let us fix a collection of strings over alphabet . We define the order- de Bruijn graph (dBG, in short) of as a directed multigraph, denoted by , where is the set of length- substrings of the strings in and has an edge with multiplicity if and only if the strings and are equal and occurring exactly times in total in the strings of collection . In bioinformatics, models a collection of DNA sequences coming from a genome sample through a sequencing experiment, and any Eulerian trail of – a graph walk using each edge of exactly once – represents a potential reconstruction of the genome [21, 18]. Inspect Figure 1. This is an idealized model though, since would never be Eulerian in practice due to sequencing errors [19]; and, furthermore, would not even be weakly connected. We could make Eulerian by increasing the multiplicity of some of its existing edges [17] or introducing new ones [5]. In either case, the natural optimization goal is to minimize the total cost (number) of the added edges. In fact, many algorithms underlying genome assembly tackle similar problems [24, 1, 9, 23].
We also define the complete de Bruijn graph of order over the alphabet , denoted by , as a directed graph, where is the set of all the strings from and contains an edge if and only if string is obtained from string by appending letter after its last position and removing letter : has nodes and edges.
Our Motivation.
Bernardini et al. [4] studied the problem of making any arbitrary weakly connected by introducing a set of new edges of minimum total cost (as well as the underlying set of new nodes when those do not exist in ); by arbitrary , we mean an arbitrary subgraph of the complete dBG . Note that making any arbitrary graph weakly connected by introducing a set of new edges of minimum total cost is trivial: if the graph consists of components, the solution is to greedily add edges to connect the components. The algorithmic challenge with dBGs follows by their definition: not every pair of nodes can be connected via a single edge. Solving this problem is important because we can then use the linear-time algorithm of Bernardini et al. from [5] to balance the weakly connected graph by adding a set of edges of minimum total cost thus making Eulerian.111By Euler’s famous theorem, we know that a weakly connected directed graph is Eulerian if and only if every graph node has equal in-degree and out-degree (except perhaps for the source and target). Recall that making directly Eulerian by adding a set of new edges of minimum total cost is NP-hard from the well-known shortest common superstring problem [12]; hence, the connect-and-balance strategy is reasonable and generally serves as a good-performing heuristic [5]. Two remarks are in order: first, since the general task is to connect , we can safely assume that is either or (i.e., multiplicities play no role here); and second, since the task, in particular, is to connect with the smallest number of edge additions, we should rather seek shortest undirected paths in (i.e., shortest sequences of edges from , in which the edge directions are neglected). Inspect Figure 2.
Bernardini et al. [4] showed that making weakly connected by adding a set of edges of minimum total cost is NP-hard. They also showed that no polynomial-time approximation scheme (PTAS) exists for making weakly connected by adding a set of edges of minimum total cost (unless the unique games conjecture [13] fails). Finally, they also showed that there exists an -time -approximation algorithm for the same problem, where is the number of connected components of . In this paper, we introduce a general framework for finding shortest undirected paths in dBGs. In particular, using our framework, we improve the running time of the approximation algorithm of Bernardini et al. from to , while maintaining the same approximation ratio.
Our Framework.
Let us fix families of nodes (forming connected components in most applications) from , and let us denote .222We assume that these families are pairwise disjoint. However, our solutions work without this assumption. Throughout this paper, we treat nodes of dBGs and their length- string representations as equivalent: indeed, nodes of are in a natural bijection with . In particular, , for any , and are also treated as sets of strings.
We generally aim at obtaining efficient algorithms for finding the minimal distance between the nodes from two different families and ; more formally, for any ,
where is the length of a shortest undirected path from node to node in .
Note that if and form two weakly connected components of , then is precisely the minimum number of edges that must be added to connect the two components into a single one – assuming that we also add the nodes implied by those edges. In the same manner, by adding such undirected paths, we can (greedily) connect to a single component, thus making any arbitrary dBG weakly connected.
The problem of finding the shortest directed path between any two nodes of can be solved in the optimal time using the preprocessing of the classic KMP algorithm [14]. The same problem for undirected paths can also be solved in the optimal time [16]. By iteratively applying the latter result, we can compute in time and for all pairs in total time, which is very slow when is large, even if the number of families is relatively small. Using more refined techniques, based on the generalized suffix tree [25] of the strings from , we develop algorithms for finding the distances much more efficiently when the families are large or when there are many of them.
In particular, given the collection , we consider the following problems:
-
: output . Here we are given, in addition, and , and we are asked to find the length of a shortest undirected path between any node of and any node of .
-
: output , for every . Here we are given, in addition, , and we are asked to find the length of a shortest undirected path between any node of and any node of , for every .
-
All-to-All: output , for all . Here we are asked to find the length of a shortest undirected path between any node of and any node of , for all .
-
: Here we are given, in addition, , and we are asked to output, for every , distinct with the smallest value of , breaking ties arbitrarily.
Let us remark that the algorithms underlying our framework are constructive – whenever we compute , we also know a pair of nodes that are at distance . By applying the technique from [16], we can enhance the output of the above algorithms with optimal paths realising those distances at the additional linear cost in the size of the output – times the length of a shortest path (every node is explicitly encoded using letters). If the output is given in a compacted form – the differences between two consecutive nodes on the path (the new letter introduced and whether it is put in the front or back) – then the additional cost reduces to the length of the shortest path.
Our Results.
We make the following specific contributions:
-
an algorithm solving in time and space;
-
an algorithm solving in time and space;
-
an algorithm solving All-to-All in time and space;
-
an algorithm solving in time and space.
Application.
By plugging our results directly in the approximation algorithm of Bernardini et al. [4], we improve the running time of their algorithm for connecting dBGs from to . Using more refined techniques, we obtain an even further improvement; in particular, we design an -time algorithm for the same task, while maintaining the same approximation ratio of [4].
Paper Organization.
In Section 2, we present some preliminaries that essentially summarize the work in [16]. In Section 3, we present a simple linear-time algorithm for computing shortest paths in undirected dBGs. In Section 4, we present our framework: how distances between sets of nodes can be computed more efficiently in many settings. In Section 5, we apply our framework to the problem of making any arbitrary dBG weakly connected [4].
2 Preliminaries
Let denote the set , and let denote the substring of . Let represent nodes and (respectively) of .
Let be a common substring of and and assume that it occurs in those strings at positions and respectively, with . Notice that we can transform into by first removing the first letters of (appending arbitrary letters at its end at the same time), then adding the first letters of to its front (removing letters from its end – in particular all the letters added in the previous step), and then symmetrically remove the last letters and add the length- suffix of in their place. In total, this requires operations, and each such operation corresponds to an edge of the complete dBG. It turns out that one cannot get a path from to of shorter length other than by choosing a different common substring or different occurrences. This fact is summarized in Lemma 1 by Liu [16]. Inspect Figure 3.
Lemma 1 ([16]).
Let be two nodes of and be the string representations of , respectively. Then , where is the longest common prefix of and .
Suffix Tree.
The classic indexing solution for standard strings is the suffix tree [25]. Given a set of strings, the compacted trie of these strings is the trie obtained by compressing each path of nodes of degree one in the trie of the strings in . The dissolved nodes are called implicit while the preserved nodes are called explicit. For any node (implicit of explicit) of the compacted trie, denotes the string spelled from the root of the trie to ; the string depth of is then defined as . Each edge in the compacted trie has a label represented as a fragment of a string in . Thus, the compacted trie uses space [20]. The suffix tree is the compacted trie of the suffixes of string . Assuming ends with a unique terminating symbol , every suffix of is represented by a leaf annotated with index (the starting position of the suffix in ). The root node, the branching nodes, and the leaf nodes form the set of explicit nodes of ; inspect Figure 4. The suffix tree occupies space and it can be constructed in time [25, 11]. The edges of can be accessed in time if stored using perfect hashing [2]. The suffix tree then supports pattern matching queries for any pattern of length in time, where Occ is the set of output occurrences of in . The suffix tree can also be generalized to a collection of strings as the compacted trie of the suffixes of string , where , for , are distinct terminating symbols.
3 Simple Pairwise Distance Computation using Suffix Tree
Let and let . The sets and represent the occurrences of string in strings and , respectively. The distance as described in Lemma 1 can also be expressed as follows: 333If or , then the distance for this as witness is equal to , and hence the distance remains the same whether such are considered or not, thus the abstract formula can iterate over .
(1) |
Equation 1 changes the order of the maxima: the outer one is over any substring , and the inner one is over the occurrences (starting positions) of . We will next view Equation 1 through the lens of the generalized suffix tree of and .
Let be the generalized suffix tree of and . Since , has explicit nodes (and edges). For an explicit node of , let be its string depth (the length of the substring represented), , and for .
Each common substring is represented by an explicit (or implicit) node of . Notice, that if both and for some letter appear at the very same positions (both in and ), then the value of the formula for is larger than the one for by exactly , hence it is enough to focus on the explicit nodes of when we want to compute the optimal value. Hence Equation 1 can also be expressed as follows:
(2) |
Observation 2.
(respectively ) for a branching node , where is the set of direct descendants of in the suffix tree. For a leaf , the set from which (resp. ) is taken is either a singleton or an empty set.
A direct consequence of Observation 2 is that the values , and can be computed bottom up. We thus compute these values using a bottom-up traversal (and the depth via a top-down traversal) for each node in total time. This gives a simple -time algorithm for computing – after constructing the suffix tree and computing those values, it suffices to find the optimal value of Equation 2 over all the explicit nodes.
Lemma 3.
Let be two nodes of and be the string representations of , respectively. We can compute in time using .
We note that a different, yet much more complicated, -time algorithm for computing using suffix trees has already been given by Liu in [16].
Recall that . A naïve application of Lemma 3 for computing by explicitly computing the pairwise distance between all the pairs of nodes runs in time. In the next section, we present our framework: how distances between sets of nodes can be computed more efficiently in many settings.
4 Our Framework for Shortest Undirected Paths in de Bruijn Graphs
In this section, we provide simple and efficient solutions to the considered distance (shortest-path) problems. Our solutions rely only on the generalized suffix tree of the strings in question ( or ) and standard operations (graph traversals and dynamic programming) and hence are not only of theoretical interest, but should also admit efficient implementations.
4.1 One-to-One
Recall that in One-to-One, we are given , and , and we are asked to find the length of a shortest undirected path between any node of and any node of .
Note that in Equation 1, it does not really matter from which string in (resp. ) the occurrence of at position (resp. ) originates: by changing the order of the minima in the formula, we have that can be expressed by Equation 1 if we naturally extend the set , for . In particular, with the use of , for , based on the suffixes of all the strings representing , Equation 2 generalizes to the following one:
(3) |
Inspect Figure 5.
Since the size of is in , and it can be constructed in linear time, by repeating the same operations as for the computation of , we obtain:
Theorem 4.
We can solve One-to-One in time and space.
4.2 One-to-All
Recall that in , we are given and , and we are asked to find the length of a shortest undirected path between any node of and any node of , for every . Using Theorem 4, we directly obtain a solution to and to All-to-All running in and time, respectively, using space. We now proceed to a more refined processing of the suffix tree that allows us to obtain a more efficient algorithm for One-to-All, which is later used to solve . Our high-level goal is to reduce the number of nodes over which we must optimize Equation 3.
Let denote the set of all ancestors of in (including the node itself). Further let , and , for . Recall that . Recall also that as noted in Observation 2 for the leaf nodes of , denoted by , the sets are either equal to a singleton (when for is represented by node ) or empty. 444Note that for a leaf , ; that is, the label is not taken into account in computation of the length of the common substring represented – this plays a role only when computing anyway. In particular exists only for a single (as symbols are distinct). We can thus define Equation 4 as a more refined version of Equation 3:
(4) |
Example 5.
Consider the example from Figure 5: , , and . Consider the leaf nodes representing the strings and . Both leaf nodes have the following ancestors (other than themselves) and the following values:
-
ABCABC: ;
-
ABC: ;
-
empty string (root node): .
Let be the leaf node representing . We iterate over all ancestors of :
-
;
-
.
We have , which gives us the minimal distance, between and witnessed by ABCABC.
The following lemma is crucial for the correctness of our approach.
Lemma 6.
Equations 3 and 4 are equivalent.
Proof.
Let be the node of for which Equation 3 attains the optimal value, which w.l.o.g. is equal to , and let be the leaf descendants of for which and .
Since is an ancestor of , we know that , hence , thus the value of Equation 4 (witnessed by node ) is at least as large as the value of Equation 3 (witnessed by the node ).
For the converse inequality, w.l.o.g. the value of Equation 4 is equal to for a leaf node . By the definition of there exists an ancestor of such that . Hence (as ,), which shows that the value of Equation 3 cannot be smaller than the value of Equation 4.
Observation 7.
.555The equation for is analogous: .
A direct consequence of Observation 7 is that the values and can be computed top down. By computing these values using a top-down traversal in total time and space for all explicit nodes and computing Equation 4 for the leaves, we obtain:
Theorem 8.
We can solve One-to-All in time and space.
By directly computing the values and , for all , we obtain another algorithm to solve All-to-All in time. For this algorithm, the space used is . By simply using Theorem 8 times, the required space is reduced to . The computation of all the values in a single run over the generalized suffix tree has other nice properties, however – as shown in the next section it allows to restrict the output while also reducing the computation time and space.
4.3 Top
Recall that in , we are given and an integer , and we are asked to output, for every , distinct with the smallest value of , breaking ties arbitrarily.
We start with the following simple yet crucial observation: If for a fixed leaf of , the values of and over are ordered nonincreasingly, it suffices to know the first of those for each type (Min and Max), that is, we do not need to compute all the values.
Recall that the and values are computed by first computing the values and bottom up and then computing the values and top down for every node of . If we store the best (smallest for and largest for and ) values (over ) for each of those types, the values for the parent/children can be computed in time proportional to the number of nodes multiplied by . Indeed when computing the smallest (up to) values of , it suffices to find the smallest elements of the values stored in the children; hence that can be computed in time, where is the number of children of – this sums up to total time over all explicit nodes of . Here we use time to exclude the duplicate values – a check if this set is already represented can be done using an extra integer array of size (only one array for the whole computation) with -time updates.
After the computation of (resp. ) for all the nodes, the largest values (resp. ) can be obtained using Observation 7 from the values stored in the parent node, and the values (resp. ) – which are already sorted nonincreasingly due to the sorted order on (resp. ).
Finally, we once again simply iterate over the leaves of , and gather the smallest values of over all the leaves representing a suffix of a string from (using a bucket queue). We obtain the following result.
Theorem 9.
can be computed in time and space.
5 Application: Connecting de Bruijn Graphs Efficiently
While our framework may have other applications revolving around dBGs (e.g., [6, 7]), in this section, we showcase the application of making an arbitrary dBG weakly connected.
Let us fix an arbitrary dBG of order consisting of weakly connected components and also denote it by . Bernardini et al. [4] proved the following result.
Theorem 10 ([4]).
For any order- dBG consisting of weakly connected components, there exists an -time -approximation algorithm for making weakly connected by adding a set of edges of minimum total cost.
We improve Theorem 10 by slashing a factor of from the running time. Let be the graph obtained from the complete dBG by collapsing each component , , of into one super-node. The solution underlying Theorem 10 consists of the following three steps:
-
(i)
Construct the metric closure666Recall that the metric closure of a graph is the complete graph in which each edge is weighted by the shortest path distance between the nodes in . of .
-
(ii)
Compute a minimum spanning tree of the metric closure.
-
(iii)
Convert the minimum spanning tree into a set of nodes and a set of edges to be added to to make it weakly connected.
The correctness follows directly from the fact that a minimum spanning tree for the metric closure of is a -approximation for the minimum Steiner tree [15],777Recall that the minimum Steiner tree problem asks, given a graph with nonnegative edge weights and a subset of terminal nodes, to compute a tree of minimum weight that contains all terminals. where is the number of terminals and thus the number of weakly connected components of . Step (i) requires time by applying Lemma 3. Step (ii) can be done in time by applying, e.g., Prim’s algorithm [22]. Finally, Step (iii) can be done by applying again Lemma 3 to compute a shortest undirected path. This sums up to total time.
To complement Theorem 10, Bernardini et al. also showed that making weakly connected by adding a set of edges of minimum total cost is NP-hard and admits no PTAS.
Theorem 10 can be improved using our framework: Theorem 4 directly outputs the weights of the edges of the metric closure in time; this improves the running time from to time. In particular, with the use of the Top queries, we can obtain an even more efficient solution by removing the construction of the metric closure and instead computing its spanning tree directly from our input.
Let be an arbitrary undirected weighted graph. Further, let be a subset of defined by choosing, for each node , an edge incident with with the smallest weight (one of such edges in case of ties), and then removing, from each cycle obtained this way, the heaviest edge (breaking ties arbitrarily). We next state the following well-known lemma (cf. [8]), and also prove it here for completeness.
Lemma 11.
is a subset of a minimum spanning tree of .
Proof.
We show that starting from any arbitrary spanning tree of , we can modify it using a greedy approach so that the obtained spanning tree contains all edges of , and has a total weight at most as large as the weight of the initial spanning tree.
We iterate over the edges of ; we add the edge to the spanning tree and then remove one edge from the newly created cycle: the one with the largest weight and, among possibly many such edges, we prefer an edge that does not belong to .
Clearly, such an operation cannot increase the weight of the spanning tree – the worst-case scenario is that the newly added edge is immediately removed. If by applying this procedure, we never remove any edge from , then the claimed solution exists.
Towards the contradiction assume that after adding an edge , we create a cycle in which all the heaviest edges belong to . Take one such edge – it was chosen for a node . The other edge of the cycle incident with cannot be lighter (by the definition of ), hence by assumption it must have the same weight and hence must also belong to (by the assumption of this paragraph of the proof); we move on to the node for which this edge was chosen, and continue in the same way. Since the cycle is finite, at some point we have to come back to – but this means that the edges from formed a cycle – a contradiction with the definition of . Thus, is a subset of a minimum spanning tree of .
Recall that is the graph obtained from the complete dBG by collapsing each component , , of into one super-node. We show the following lemma (implementing the algorithm of [8]).
Lemma 12.
We can find a minimum spanning tree of in time using space.
Proof.
Let us remark that we do not explicitly construct or .
We start by producing a set of edges for using a query for . Such a query returns for each component of , that is, equivalently for each super-node of , two super-nodes closest to it. In particular, what is implied by this is that even if one of those two is itself (which happens in most cases), the other one must be different. By Theorem 9, in total time, we find, for each super-node of , one incident edge with the smallest weight possible – by taking this set of edges and removing from each cycle a single edge we obtain a valid set of edges .
By Lemma 11, we can safely report this set of edges as part of the output. We can also contract the super-nodes of connected with the edges from into other single super-nodes obtaining graph . Now every spanning tree of that contains is equivalent to the union of a spanning tree of and the set , hence the problem reduces to finding the minimum spanning tree of .
Notice that is represented by the very same input to our original problem on a dBG, just with some of the sets merged together, and so we can use the same generalized suffix tree just with different labels . We can thus repeat the same approach until we reach a graph with a single super-node. Note that each such iteration takes time (Theorem 9), and that there can be no more than such iterations because each time every super-node gets connected to another one, the number of super-nodes (components) drop by at least the factor – the statement follows.
Theorem 13.
For any order- dBG consisting of weakly connected components, there exists an -time -approximation algorithm for making weakly connected by adding a set of edges of minimum total cost.
Since the algorithm underlying Theorem 13 is near-optimal, the main open question is whether we can improve the approximation ratio.
References
- [1] Nidia Obscura Acosta, Veli Mäkinen, and Alexandru I. Tomescu. A safe and complete algorithm for metagenomic assembly. Algorithms Mol. Biol., 13(1):3:1–3:12, 2018. doi:10.1186/S13015-018-0122-7.
- [2] Michael A. Bender, Alex Conway, Martin Farach-Colton, William Kuszmaul, and Guido Tagliavini. Iceberg hashing: Optimizing many hash-table criteria at once. J. ACM, 70(6):40:1–40:51, 2023. doi:10.1145/3625817.
- [3] Giulia Bernardini, Huiping Chen, Gabriele Fici, Grigorios Loukides, and Solon P. Pissis. Reverse-safe text indexing. ACM J. Exp. Algorithmics, 26:1.10:1–1.10:26, 2021. doi:10.1145/3461698.
- [4] Giulia Bernardini, Huiping Chen, Inge Li Gørtz, Christoffer Krogh, Grigorios Loukides, Solon P. Pissis, Leen Stougie, and Michelle Sweering. Connecting de Bruijn Graphs. In Shunsuke Inenaga and Simon J. Puglisi, editors, 35th Annual Symposium on Combinatorial Pattern Matching (CPM 2024), volume 296 of Leibniz International Proceedings in Informatics (LIPIcs), pages 6:1–6:16, Dagstuhl, Germany, 2024. Schloss Dagstuhl – Leibniz-Zentrum für Informatik. doi:10.4230/LIPIcs.CPM.2024.6.
- [5] Giulia Bernardini, Huiping Chen, Grigorios Loukides, Solon P. Pissis, Leen Stougie, and Michelle Sweering. Making de Bruijn graphs Eulerian. In Hideo Bannai and Jan Holub, editors, 33rd Annual Symposium on Combinatorial Pattern Matching, CPM 2022, June 27-29, 2022, Prague, Czech Republic, volume 223 of LIPIcs, pages 12:1–12:18. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2022. doi:10.4230/LIPICS.CPM.2022.12.
- [6] Giulia Bernardini, Chang Liu, Grigorios Loukides, Alberto Marchetti-Spaccamela, Solon P. Pissis, Leen Stougie, and Michelle Sweering. Missing value replacement in strings and applications. Data Min. Knowl. Discov., 39(2):12, 2025. doi:10.1007/S10618-024-01074-3.
- [7] Giulia Bernardini, Alberto Marchetti-Spaccamela, Solon P. Pissis, Leen Stougie, and Michelle Sweering. Constructing strings avoiding forbidden substrings. In Pawel Gawrychowski and Tatiana Starikovskaya, editors, 32nd Annual Symposium on Combinatorial Pattern Matching, CPM 2021, July 5-7, 2021, Wrocław, Poland, volume 191 of LIPIcs, pages 9:1–9:18. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2021. doi:10.4230/LIPICS.CPM.2021.9.
- [8] Otakar Borůvka. O jistém problému minimálním. Práce Moravské přírodovědecké společnosti, sv. III(spis 3):37–58, 1926. URL: https://dml.cz/handle/10338.dmlcz/500114.
- [9] Massimo Cairo, Shahbaz Khan, Romeo Rizzi, Sebastian Schmidt, Alexandru I. Tomescu, and Elia C. Zirondelli. The hydrostructure: a universal framework for safe and complete algorithms for genome assembly, 2021. arXiv:2011.12635.
- [10] Alessio Conte, Roberto Grossi, Grigorios Loukides, Nadia Pisanti, Solon P. Pissis, and Giulia Punzi. Beyond the BEST theorem: Fast assessment of eulerian trails. In Evripidis Bampis and Aris Pagourtzis, editors, Fundamentals of Computation Theory - 23rd International Symposium, FCT 2021, Athens, Greece, September 12-15, 2021, Proceedings, volume 12867 of Lecture Notes in Computer Science, pages 162–175. Springer, 2021. doi:10.1007/978-3-030-86593-1_11.
- [11] Martin Farach. Optimal suffix tree construction with large alphabets. In 38th Annual Symposium on Foundations of Computer Science, FOCS ’97, Miami Beach, Florida, USA, October 19-22, 1997, pages 137–143, 1997. doi:10.1109/SFCS.1997.646102.
- [12] John Gallant, David Maier, and James A. Storer. On finding minimal length superstrings. J. Comput. Syst. Sci., 20(1):50–58, 1980. doi:10.1016/0022-0000(80)90004-5.
- [13] Subhash Khot. On the power of unique 2-prover 1-round games. In Proceedings of the 17th Annual IEEE Conference on Computational Complexity, Montréal, Québec, Canada, May 21-24, 2002, page 25. IEEE Computer Society, 2002. doi:10.1109/CCC.2002.1004334.
- [14] Donald E. Knuth, James H. Morris Jr., and Vaughan R. Pratt. Fast pattern matching in strings. SIAM J. Comput., 6(2):323–350, 1977. doi:10.1137/0206024.
- [15] Lawrence T. Kou, George Markowsky, and Leonard Berman. A fast algorithm for Steiner trees. Acta Informatica, 15:141–145, 1981. doi:10.1007/BF00288961.
- [16] Zhen Liu. Optimal routing in the de Bruijn networks. In 10th International Conference on Distributed Computing Systems (ICDCS 1990), May 28 - June 1, 1990, Paris, France, pages 537–544. IEEE Computer Society, 1990. doi:10.1109/ICDCS.1990.89261.
- [17] Paul Medvedev, Konstantinos Georgiou, Gene Myers, and Michael Brudno. Computability of models for sequence assembly. In 7th WABI, volume 4645 of Lecture Notes in Computer Science, pages 289–301. Springer, 2007. doi:10.1007/978-3-540-74126-8_27.
- [18] Paul Medvedev and Mihai Pop. What do Eulerian and Hamiltonian cycles have to do with genome assembly? PLOS Computational Biology, 17(5):1–5, May 2021. doi:10.1371/journal.pcbi.1008928.
- [19] Jason R. Miller, Sergey Koren, and Granger Sutton. Assembly algorithms for next-generation sequencing data. Genomics, 95(6):315–327, 2010. doi:10.1016/j.ygeno.2010.03.001.
- [20] Donald R. Morrison. PATRICIA - practical algorithm to retrieve information coded in alphanumeric. J. ACM, 15(4):514–534, 1968. doi:10.1145/321479.321481.
- [21] Pavel A. Pevzner, Haixu Tang, and Michael S. Waterman. An Eulerian path approach to DNA fragment assembly. Proc Natl Acad Sci, 98(17):9748–9753, 2001. doi:10.1073/pnas.171285098.
- [22] Robert C. Prim. Shortest connection networks and some generalizations. Bell System Technical Journal, 36:1389–1401, 1957. doi:10.1002/j.1538-7305.1957.tb01515.x.
- [23] Sebastian Schmidt, Shahbaz Khan, Jarno N Alanko, Giulio E Pibiri, and Alexandru I Tomescu. Matchtigs: minimum plain text representation of k-mer sets. Genome Biology, 24(1):136, 2023. doi:10.1186/s13059-023-02968-z.
- [24] Alexandru I. Tomescu and Paul Medvedev. Safe and complete contig assembly through omnitigs. J. Comput. Biol., 24(6):590–602, 2017. doi:10.1089/CMB.2016.0141.
- [25] Peter Weiner. Linear pattern matching algorithms. In 14th Annual Symposium on Switching and Automata Theory, Iowa City, Iowa, USA, October 15-17, 1973, pages 1–11. IEEE Computer Society, 1973. doi:10.1109/SWAT.1973.13.