Connecting de Bruijn Graphs

Authors Giulia Bernardini , Huiping Chen , Inge Li Gørtz , Christoffer Krogh , Grigorios Loukides , Solon P. Pissis , Leen Stougie , Michelle Sweering



PDF
Thumbnail PDF

File

LIPIcs.CPM.2024.6.pdf
  • Filesize: 1.31 MB
  • 16 pages

Document Identifiers

Author Details

Giulia Bernardini
  • University of Trieste, Trieste, Italy
Huiping Chen
  • University of Birmingham, Birmingham, UK
Inge Li Gørtz
  • Technical University of Denmark, Lyngby, Denmark
Christoffer Krogh
  • Technical University of Denmark, Lyngby, Denmark
Grigorios Loukides
  • King’s College London, London, UK
Solon P. Pissis
  • CWI, Amsterdam, The Netherlands
  • Vrije Universiteit, Amsterdam, The Netherlands
Leen Stougie
  • CWI, Amsterdam, The Netherlands
  • Vrije Universiteit, Amsterdam, The Netherlands
Michelle Sweering
  • CWI, Amsterdam, The Netherlands

Cite AsGet BibTex

Giulia Bernardini, Huiping Chen, Inge Li Gørtz, Christoffer Krogh, Grigorios Loukides, Solon P. Pissis, Leen Stougie, and Michelle Sweering. Connecting de Bruijn Graphs. In 35th Annual Symposium on Combinatorial Pattern Matching (CPM 2024). Leibniz International Proceedings in Informatics (LIPIcs), Volume 296, pp. 6:1-6:16, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2024)
https://doi.org/10.4230/LIPIcs.CPM.2024.6

Abstract

We study the problem of making a de Bruijn graph (dBG), constructed from a collection of strings, weakly connected while minimizing the total cost of edge additions. The input graph is a dBG that can be made weakly connected by adding edges (along with extra nodes if needed) from the underlying complete dBG. The problem arises from genome reconstruction, where the dBG is constructed from a set of sequences generated from a genome sample by a sequencing experiment. Due to sequencing errors, the dBG is never Eulerian in practice and is often not even weakly connected. We show the following results for a dBG G(V,E) of order k consisting of d weakly connected components: 1) Making G weakly connected by adding a set of edges of minimal total cost is NP-hard. 2) No PTAS exists for making G weakly connected by adding a set of edges of minimal total cost (unless the unique games conjecture fails). We complement this result by showing that there does exist a polynomial-time (2-2/d)-approximation algorithm for the problem. 3) We consider a restricted version of the above problem, where we are asked to make G weakly connected by only adding directed paths between pairs of components. We show that making G weakly connected by adding d-1 such paths of minimal total cost can be done in 𝒪(k|V|α(|V|)+|E|) time, where α(⋅) is the inverse Ackermann function. This improves on the 𝒪(k|V|log(|V|)+|E|)-time algorithm proposed by Bernardini et al. [CPM 2022] for the same restricted problem. 4) An ILP formulation of polynomial size for making G Eulerian with minimal total cost.

Subject Classification

ACM Subject Classification
  • Theory of computation → Pattern matching
Keywords
  • string algorithm
  • graph algorithm
  • de Bruijn graph
  • Eulerian graph

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Alfred V. Aho and Margaret J. Corasick. Efficient string matching: An aid to bibliographic search. Commun. ACM, 18(6):333-340, 1975. URL: https://doi.org/10.1145/360825.360855.
  2. Per Austrin, Subhash Khot, and Muli Safra. Inapproximability of vertex cover and independent set in bounded degree graphs. In 2009 24th Annual IEEE Conference on Computational Complexity, pages 74-80, 2009. URL: https://doi.org/10.1109/CCC.2009.38.
  3. Giulia Bernardini, Huiping Chen, Gabriele Fici, Grigorios Loukides, and Solon P. Pissis. Reverse-safe data structures for text indexing. In Symposium on Algorithm Engineering and Experiments (ALENEX), pages 199-213. SIAM, 2020. URL: https://doi.org/10.1137/1.9781611976007.16.
  4. Giulia Bernardini, Huiping Chen, Gabriele Fici, Grigorios Loukides, and Solon P. Pissis. Reverse-safe text indexing. ACM J. Exp. Algorithmics, 26:1.10:1-1.10:26, 2021. URL: https://doi.org/10.1145/3461698.
  5. Giulia Bernardini, Huiping Chen, Grigorios Loukides, Solon P. Pissis, Leen Stougie, and Michelle Sweering. Making de Bruijn graphs Eulerian. In 33rd Annual Symposium on Combinatorial Pattern Matching (CPM), volume 223 of LIPIcs, pages 12:1-12:18. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2022. URL: https://doi.org/10.4230/LIPIcs.CPM.2022.12.
  6. Giulia Bernardini, Alessio Conte, Estéban Gabory, Roberto Grossi, Grigorios Loukides, Solon P. Pissis, Giulia Punzi, and Michelle Sweering. On strings having the same length-k substrings. In 33rd Annual Symposium on Combinatorial Pattern Matching (CPM), volume 223 of LIPIcs, pages 16:1-16:17. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2022. URL: https://doi.org/10.4230/LIPIcs.CPM.2022.16.
  7. Giulia Bernardini, Alberto Marchetti-Spaccamela, Solon P. Pissis, Leen Stougie, and Michelle Sweering. Constructing strings avoiding forbidden substrings. In 32nd Annual Symposium on Combinatorial Pattern Matching (CPM), volume 191 of LIPIcs, pages 9:1-9:18. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2021. URL: https://doi.org/10.4230/LIPICS.CPM.2021.9.
  8. Karel Břinda, Michael Baym, and Gregory Kucherov. Simplitigs as an efficient and scalable representation of de Bruijn graphs. Genome biology, 22:1-24, 2021. URL: https://doi.org/10.1186/s13059-021-02297-z.
  9. Shiri Dori and Gad M. Landau. Construction of Aho Corasick automaton in linear time for integer alphabets. Inf. Process. Lett., 98(2):66-72, 2006. URL: https://doi.org/10.1016/j.ipl.2005.11.019.
  10. Zvi Galil and Giuseppe F Italiano. Data structures and algorithms for disjoint set union problems. ACM Computing Surveys (CSUR), 23(3):319-344, 1991. URL: https://doi.org/10.1145/116873.116878.
  11. John Gallant, David Maier, and James A. Storer. On finding minimal length superstrings. J. Comput. Syst. Sci., 20(1):50-58, 1980. URL: https://doi.org/10.1016/0022-0000(80)90004-5.
  12. Dan Gusfield, Gad M. Landau, and Baruch Schieber. An efficient algorithm for the all pairs suffix-prefix problem. Inf. Process. Lett., 41(4):181-185, 1992. URL: https://doi.org/10.1016/0020-0190(92)90176-V.
  13. John E. Hopcroft and Robert Endre Tarjan. Efficient algorithms for graph manipulation [H] (algorithm 447). Commun. ACM, 16(6):372-378, 1973. URL: https://doi.org/10.1145/362248.362272.
  14. George Karakostas. A better approximation ratio for the vertex cover problem. ACM Trans. Algorithms, 5(4):41:1-41:8, 2009. URL: https://doi.org/10.1145/1597036.1597045.
  15. Richard M. Karp. Reducibility among combinatorial problems. In Proceedings of a symposium on the Complexity of Computer Computation, The IBM Research Symposia Series, pages 85-103. Plenum Press, New York, 1972. URL: https://doi.org/10.1007/978-1-4684-2001-2_9.
  16. Subhash Khot. On the power of unique 2-prover 1-round games. In Proceedings on 34th Annual ACM Symposium on Theory of Computing (STOC), pages 767-775. ACM, 2002. URL: https://doi.org/10.1145/509907.510017.
  17. Donald E. Knuth, James H. Morris Jr., and Vaughan R. Pratt. Fast pattern matching in strings. SIAM J. Comput., 6(2):323-350, 1977. URL: https://doi.org/10.1137/0206024.
  18. Lawrence T. Kou, George Markowsky, and Leonard Berman. A fast algorithm for Steiner trees. Acta Informatica, 15:141-145, 1981. URL: https://doi.org/10.1007/BF00288961.
  19. Joseph B Kruskal. On the shortest spanning subtree of a graph and the traveling salesman problem. Proceedings of the American Mathematical Society, 7(1):48-50, 1956. URL: https://doi.org/10.1090/S0002-9939-1956-0078686-7.
  20. Zhen Liu. Optimal routing in the De Bruijn networks. Research Report RR-1130, INRIA, 1990. URL: https://hal.inria.fr/inria-00075429.
  21. Grigorios Loukides and Solon P. Pissis. All-pairs suffix/prefix in optimal time using Aho-Corasick space. Inf. Process. Lett., 178:106275, 2022. URL: https://doi.org/10.1016/J.IPL.2022.106275.
  22. Paul Medvedev, Konstantinos Georgiou, Gene Myers, and Michael Brudno. Computability of models for sequence assembly. In 7th WABI, volume 4645 of Lecture Notes in Computer Science, pages 289-301. Springer, 2007. URL: https://doi.org/10.1007/978-3-540-74126-8_27.
  23. Paul Medvedev and Mihai Pop. What do Eulerian and Hamiltonian cycles have to do with genome assembly? PLOS Computational Biology, 17(5):1-5, May 2021. URL: https://doi.org/10.1371/journal.pcbi.1008928.
  24. Jason R. Miller, Sergey Koren, and Granger Sutton. Assembly algorithms for next-generation sequencing data. Genomics, 95(6):315-327, 2010. URL: https://doi.org/10.1016/j.ygeno.2010.03.001.
  25. Christos H. Papadimitriou and Mihalis Yannakakis. Optimization, approximation, and complexity classes. J. Comput. Syst. Sci., 43(3):425-440, 1991. URL: https://doi.org/10.1016/0022-0000(91)90023-X.
  26. Pavel A. Pevzner, Haixu Tang, and Michael S. Waterman. An Eulerian path approach to DNA fragment assembly. Proc Natl Acad Sci, 98(17):9748-9753, 2001. URL: https://doi.org/10.1073/pnas.171285098.
  27. Robert C. Prim. Shortest connection networks and some generalizations. Bell System Technical Journal, 36:1389-1401, 1957. URL: https://doi.org/10.1002/j.1538-7305.1957.tb01515.x.
  28. Amatur Rahman and Paul Medevedev. Representation of k-mer sets using spectrum-preserving string sets. J. Comput. Biol., 28(4):381-394, 2021. URL: https://doi.org/10.1089/CMB.2020.0431.
  29. Sebastian Schmidt, Shahbaz Khan, Jarno N Alanko, Giulio E Pibiri, and Alexandru I Tomescu. Matchtigs: minimum plain text representation of k-mer sets. Genome Biology, 24(1):136, 2023. URL: https://doi.org/10.1186/s13059-023-02968-z.
  30. Sebastian S. Schmidt and Jarno N. Alanko. Eulertigs: minimum plain text representation of k-mer sets without repetitions in linear time. Algorithms Mol. Biol., 18(1):5, 2023. URL: https://doi.org/10.1186/S13015-023-00227-1.
  31. Ondřej Sladkỳ, Pavel Veselỳ, and Karel Břinda. Masked superstrings as a unified framework for textual k-mer set representations. bioRxiv, pages 2023-02, 2023. Google Scholar
  32. William H.A. Tustumi, Simon Gog, Guilherme P. Telles, and Felipe A. Louza. An improved algorithm for the all-pairs suffix–prefix problem. Journal of Discrete Algorithms, 37:34-43, 2016. 2015 London Stringology Days and London Algorithmic Workshop (LSD & LAW). URL: https://doi.org/10.1016/j.jda.2016.04.002.
  33. Esko Ukkonen. A linear-time algorithm for finding approximate shortest common superstrings. Algorithmica, 5(3):313-323, 1990. URL: https://doi.org/10.1007/BF01840391.