Document

Making de Bruijn Graphs Eulerian

File

LIPIcs.CPM.2022.12.pdf
• Filesize: 1.01 MB
• 18 pages

Cite As

Giulia Bernardini, Huiping Chen, Grigorios Loukides, Solon P. Pissis, Leen Stougie, and Michelle Sweering. Making de Bruijn Graphs Eulerian. In 33rd Annual Symposium on Combinatorial Pattern Matching (CPM 2022). Leibniz International Proceedings in Informatics (LIPIcs), Volume 223, pp. 12:1-12:18, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2022)
https://doi.org/10.4230/LIPIcs.CPM.2022.12

Abstract

A directed multigraph is called Eulerian if it has a circuit which uses each edge exactly once. Euler’s theorem tells us that a weakly connected directed multigraph is Eulerian if and only if every node is balanced. Given a collection S of strings over an alphabet Σ, the de Bruijn graph (dBG) of order k of S is a directed multigraph G_{S,k}(V,E), where V is the set of length-(k-1) substrings of the strings in S, and G_{S,k} contains an edge (u,v) with multiplicity m_{u,v}, if and only if the string u[0]⋅ v is equal to the string u⋅ v[k-2] and this string occurs exactly m_{u,v} times in total in strings in S. Let G_{Σ,k}(V_{Σ,k},E_{Σ,k}) be the complete dBG of Σ^k. The Eulerian Extension (EE) problem on G_{S,k} asks to extend G_{S,k} with a set ℬ of nodes from V_{Σ,k} and a smallest multiset 𝒜 of edges from E_{Σ,k} to make it Eulerian. Note that extending dBGs is algorithmically much more challenging than extending general directed multigraphs because some edges in dBGs are by definition forbidden. Extending dBGs lies at the heart of sequence assembly [Medvedev et al., WABI 2007], one of the most important tasks in bioinformatics. The novelty of our work with respect to existing works is that we allow not only to duplicate existing edges of G_{S,k} but to also add novel edges and nodes, in an effort to (i) connect multiple components and (ii) reduce the total EE cost. It is easy to show that EE on G_{S,k} is NP-hard via a reduction from shortest common superstring. We further show that EE remains NP-hard, even when we are not allowed to add new nodes, via a highly non-trivial reduction from 3-SAT. We thus investigate the following two problems underlying EE in dBGs: 1) When G_{S,k} is not weakly connected, we are asked to connect its d > 1 components using a minimum-weight spanning tree, whose edges are paths on the underlying G_{Σ,k} and weights are the corresponding path lengths. This way of connecting guarantees that no new unbalanced node is added. We show that this problem can be solved in 𝒪(|V|klog d+|E|) time, which is nearly optimal, since the size of G_{S,k} is Θ(|V|k+|E|). 2) When G_{S,k} is not balanced, we are asked to extend G_{S,k} to H_{S,k}(V∪ℬ,E∪𝒜) such that every node of H_{S,k} is balanced and the total number |𝒜| of added edges is minimized. We show that this problem can be solved in the optimal 𝒪(k|V| + |E|+ |𝒜|) time. Let us stress that, although our main contributions are theoretical, the algorithms we design for the above two problems are practical. We combine the two algorithms in one method that makes any dBG Eulerian; and show experimentally that the cost of the obtained feasible solutions on real-world dBGs is substantially smaller than the corresponding cost obtained by existing greedy approaches.

Subject Classification

ACM Subject Classification
• Theory of computation → Pattern matching
Keywords
• string algorithms
• graph algorithms
• Eulerian graph
• de Bruijn graph

Metrics

• Access Statistics
• Total Accesses (updated on a weekly basis)
0

References

1. Alfred V. Aho and Margaret J. Corasick. Efficient string matching: An aid to bibliographic search. Commun. ACM, 18(6):333-340, 1975. URL: https://doi.org/10.1145/360825.360855.
2. Jarno Alanko and Tuukka Norri. Greedy shortest common superstring approximation in compact space. In 24th SPIRE, volume 10508 of Lecture Notes in Computer Science, pages 1-13. Springer, 2017. URL: https://doi.org/10.1007/978-3-319-67428-5_1.
3. Bastien Cazaux and Eric Rivals. A linear time algorithm for shortest cyclic cover of strings. J. Discrete Algorithms, 37:56-67, 2016. URL: https://doi.org/10.1016/j.jda.2016.05.001.
4. Bastien Cazaux and Eric Rivals. Superstrings with multiplicities. In 29th CPM, volume 105 of LIPIcs, pages 21:1-21:16. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2018. URL: https://doi.org/10.4230/LIPIcs.CPM.2018.21.
5. Maxime Crochemore, Marek Cygan, Costas S. Iliopoulos, Marcin Kubica, Jakub Radoszewski, Wojciech Rytter, and Tomasz Walen. Algorithms for three versions of the shortest common superstring problem. In 21st CPM, pages 299-309, 2010. URL: https://doi.org/10.1007/978-3-642-13509-5_27.
6. Maxime Crochemore, Christophe Hancart, and Thierry Lecroq. Algorithms on strings. Cambridge University Press, 2007.
7. Martin Dietzfelbinger and Friedhelm Meyer auf der Heide. A new universal class of hash functions and dynamic hashing in real time. In 17th ICALP, pages 6-19, 1990. URL: https://doi.org/10.1007/BFb0032018.
8. Shiri Dori and Gad M. Landau. Construction of Aho Corasick automaton in linear time for integer alphabets. Inf. Process. Lett., 98(2):66-72, 2006. URL: https://doi.org/10.1016/j.ipl.2005.11.019.
9. Frederic Dorn, Hannes Moser, Rolf Niedermeier, and Mathias Weller. Efficient algorithms for Eulerian extension and Rural Postman. SIAM J. Discret. Math., 27(1):75-94, 2013. URL: https://doi.org/10.1137/110834810.
10. Ran Duan and Seth Pettie. Linear-time approximation for maximum weight matching. J. ACM, 61(1):1:1-1:23, 2014. URL: https://doi.org/10.1145/2529989.
11. Sara El-Metwally, Taher Hamza, Magdi Zakaria, and Mohamed Helmy. Next-generation sequence assembly: Four stages of data processing and computational challenges. PLoS Comput. Biol., 9(12), 2013. URL: https://doi.org/10.1371/journal.pcbi.1003345.
12. John Gallant, David Maier, and James A. Storer. On finding minimal length superstrings. J. Comput. Syst. Sci., 20(1):50-58, 1980. URL: https://doi.org/10.1016/0022-0000(80)90004-5.
13. Michael R. Garey and David S. Johnson. Computers and Intractability; A Guide to the Theory of NP-Completeness. W. H. Freeman & Co., USA, 1990.
14. John E. Hopcroft and Robert Endre Tarjan. Efficient algorithms for graph manipulation [H] (algorithm 447). Commun. ACM, 16(6):372-378, 1973. URL: https://doi.org/10.1145/362248.362272.
15. Richard M. Karp. Reducibility among combinatorial problems. In Proceedings of a symposium on the Complexity of Computer Computation, The IBM Research Symposia Series, pages 85-103. Plenum Press, New York, 1972. URL: https://doi.org/10.1007/978-1-4684-2001-2_9.
16. Donald E. Knuth, James H. Morris Jr., and Vaughan R. Pratt. Fast pattern matching in strings. SIAM J. Comput., 6(2):323-350, 1977. URL: https://doi.org/10.1137/0206024.
17. Joseph B Kruskal. On the shortest spanning subtree of a graph and the traveling salesman problem. Proceedings of the American Mathematical Society, 7(1):48-50, 1956. URL: https://doi.org/10.1090/S0002-9939-1956-0078686-7.
18. Paul Medvedev. Modeling biological problems in computer science: a case study in genome assembly. Briefings Bioinform., 20(4):1376-1383, 2019. URL: https://doi.org/10.1093/bib/bby003.
19. Paul Medvedev, Konstantinos Georgiou, Gene Myers, and Michael Brudno. Computability of models for sequence assembly. In 7th WABI, volume 4645 of Lecture Notes in Computer Science, pages 289-301. Springer, 2007. URL: https://doi.org/10.1007/978-3-540-74126-8_27.
20. Paul Medvedev and Mihai Pop. What do Eulerian and Hamiltonian cycles have to do with genome assembly? PLOS Computational Biology, 17(5):1-5, May 2021. URL: https://doi.org/10.1371/journal.pcbi.1008928.
21. Jason R. Miller, Sergey Koren, and Granger Sutton. Assembly algorithms for next-generation sequencing data. Genomics, 95(6):315-327, 2010. URL: https://doi.org/10.1016/j.ygeno.2010.03.001.
22. Niranjan Nagarajan and Mihai Pop. Sequence assembly demystified. Nature Reviews Genetics, 14:157-167, 2013.
23. Clifford S. Orloff. A fundamental problem in vehicle routing. Networks, 4(1):35-64, 1974. URL: https://doi.org/10.1002/net.3230040105.
24. Pavel A. Pevzner, Haixu Tang, and Michael S. Waterman. An Eulerian path approach to DNA fragment assembly. Proc Natl Acad Sci, 98(17):9748-9753, 2001. URL: https://doi.org/10.1073/pnas.171285098.
25. Michael C. Schatz, Arthur L. Delcher, and Steven L. Salzberg. Assembly of large genomes using second-generation sequencing. Genome Res., 20(9):1165-1173, 2010. URL: https://doi.org/10.1101/gr.101360.109.
26. Jared T. Simpson and Mihai Pop. The theory and practice of genome sequence assembly. Annu Rev Genomics Hum Genet, 16:153-172, 2015.
27. Jang-il Sohn and Jin-Wu Nam. The present and future of de novo whole-genome assembly. Briefings Bioinform., 19(1):23-40, 2018. URL: https://doi.org/10.1093/bib/bbw096.
28. Esko Ukkonen. A linear-time algorithm for finding approximate shortest common superstrings. Algorithmica, 5(3):313-323, 1990. URL: https://doi.org/10.1007/BF01840391.
29. Bilal Wajid and Erchin Serpedin. Review of general algorithmic features for genome assemblers for next generation sequencers. Genomics, Proteomics & Bioinformatics, 10(2):58-73, 2012. URL: https://doi.org/doi.org/10.1016/j.gpb.2012.05.006.
X

Feedback for Dagstuhl Publishing