Eulertigs: Minimum Plain Text Representation of k-mer Sets Without Repetitions in Linear Time

Authors Sebastian Schmidt , Jarno N. Alanko



PDF
Thumbnail PDF

File

LIPIcs.WABI.2022.2.pdf
  • Filesize: 0.95 MB
  • 21 pages

Document Identifiers

Author Details

Sebastian Schmidt
  • University of Helsinki, Finland
Jarno N. Alanko
  • University of Helsinki, Finland

Cite AsGet BibTex

Sebastian Schmidt and Jarno N. Alanko. Eulertigs: Minimum Plain Text Representation of k-mer Sets Without Repetitions in Linear Time. In 22nd International Workshop on Algorithms in Bioinformatics (WABI 2022). Leibniz International Proceedings in Informatics (LIPIcs), Volume 242, pp. 2:1-2:21, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2022)
https://doi.org/10.4230/LIPIcs.WABI.2022.2

Abstract

A fundamental operation in computational genomics is to reduce the input sequences to their constituent k-mers. For maximum performance of downstream applications it is important to store the k-mers in small space, while keeping the representation easy and efficient to use (i.e. without k-mer repetitions and in plain text). Recently, heuristics were presented to compute a near-minimum such representation. We present an algorithm to compute a minimum representation in optimal (linear) time and use it to evaluate the existing heuristics. For that, we present a formalisation of arc-centric bidirected de Bruijn graphs and carefully prove that it accurately models the k-mer spectrum of the input. Our algorithm first constructs the de Bruijn graph in linear time in the length of the input strings (for a fixed-size alphabet). Then it uses a Eulerian-cycle-based algorithm to compute the minimum representation, in time linear in the size of the output.

Subject Classification

ACM Subject Classification
  • Applied computing → Computational biology
  • Theory of computation → Data compression
  • Theory of computation → Graph algorithms analysis
  • Theory of computation → Data structures design and analysis
Keywords
  • Spectrum preserving string sets
  • Eulerian cycle
  • Suffix tree
  • Bidirected arc-centric de Bruijn graph
  • k-mer based methods

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Anton Bankevich, Andrey V Bzikadze, Mikhail Kolmogorov, Dmitry Antipov, and Pavel A Pevzner. Multiplex de bruijn graphs enable genome assembly from long, high-fidelity reads. Nature biotechnology, pages 1-7, 2022. Google Scholar
  2. Djamal Belazzougui and Fabio Cunial. Fully-functional bidirectional burrows-wheeler indexes and infinite-order de bruijn graphs. In 30th Annual Symposium on Combinatorial Pattern Matching (CPM 2019). Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2019. Google Scholar
  3. Djamal Belazzougui, Fabio Cunial, Juha Kärkkäinen, and Veli Mäkinen. Versatile succinct representations of the bidirectional burrows-wheeler transform. In European Symposium on Algorithms, pages 133-144. Springer, 2013. Google Scholar
  4. Djamal Belazzougui, Dmitry Kosolobov, Simon J Puglisi, and Rajeev Raman. Weighted ancestors in suffix trees revisited. In 32nd Annual Symposium on Combinatorial Pattern Matching, 2021. Google Scholar
  5. Jacek Blazewicz, Alain Hertz, Daniel Kobler, and Dominique de Werra. On some properties of dna graphs. Discrete Applied Mathematics, 98(1-2):1-19, 1999. Google Scholar
  6. Karel Břinda, Michael Baym, and Gregory Kucherov. Simplitigs as an efficient and scalable representation of de Bruijn graphs. Genome Biology, 22(1):1-24, 2021. Google Scholar
  7. Bastien Cazaux, Thierry Lecroq, and Eric Rivals. From indexing data structures to de Bruijn graphs. In Symposium on combinatorial pattern matching, pages 89-99. Springer, 2014. Google Scholar
  8. Rayan Chikhi, Antoine Limasset, and Paul Medvedev. Compacting de bruijn graphs from sequencing data quickly and in low memory. Bioinformatics, 32(12):i201-i208, 2016. Google Scholar
  9. Victoria G Crawford, Alan Kuhnle, Christina Boucher, Rayan Chikhi, and Travis Gagie. Practical dynamic de bruijn graphs. Bioinformatics, 34(24):4189-4195, 2018. Google Scholar
  10. Herbert Fleischner. Eulerian graphs and related topics. Elsevier, 1990. Google Scholar
  11. Björn Grüning, Ryan Dale, Andreas Sjödin, Brad A Chapman, Jillian Rowe, Christopher H Tomkins-Tinch, Renan Valieris, and Johannes Köster. Bioconda: sustainable and comprehensive software distribution for the life sciences. Nature Methods, 15(7):475-476, 2018. Google Scholar
  12. Dan Gusfield. Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology. Cambridge University Press, 1997. URL: https://doi.org/10.1017/cbo9780511574931.
  13. Guillaume Holley and Páll Melsted. Bifrost: highly parallel construction and indexing of colored and compacted de Bruijn graphs. Genome Biology, 21(1):1-20, 2020. Google Scholar
  14. Marta Kasprzak. Classification of de Bruijn-based labeled digraphs. Discrete Applied Mathematics, 234:86-92, 2018. Special Issue on the Ninth International Colloquium on Graphs and Optimization (GO IX), 2014. URL: https://doi.org/10.1016/j.dam.2016.10.014.
  15. Jamshed Khan, Marek Kokot, Sebastian Deorowicz, and Rob Patro. Scalable, ultra-fast, and low-memory construction of compacted de bruijn graphs with cuttlefish 2. bioRxiv, 2021. Google Scholar
  16. Johannes Köster and Sven Rahmann. Snakemake - A scalable bioinformatics workflow engine. Bioinformatics, 28(19):2520-2522, 2012. Google Scholar
  17. Vamsi Kundeti, Sanguthevar Rajasekaran, and Heiu Dinh. An efficient algorithm for chinese postman walk on bi-directed de bruijn graphs. In Weili Wu and Ovidiu Daescu, editors, Combinatorial Optimization and Applications, pages 184-196, Berlin, Heidelberg, 2010. Springer Berlin Heidelberg. Google Scholar
  18. Veli Mäkinen, Djamal Belazzougui, Fabio Cunial, and Alexandru I Tomescu. Genome-scale algorithm design. Cambridge University Press, 2015. Google Scholar
  19. Martin D Muggli, Bahar Alipanahi, and Christina Boucher. Building large updatable colored de bruijn graphs via merging. Bioinformatics, 35(14):i51-i60, 2019. Google Scholar
  20. Martin D Muggli, Alexander Bowe, Noelle R Noyes, Paul S Morley, Keith E Belk, Robert Raymond, Travis Gagie, Simon J Puglisi, and Christina Boucher. Succinct colored de bruijn graphs. Bioinformatics, 33(20):3181-3187, 2017. Google Scholar
  21. Giulio Ermanno Pibiri. Sparse and skew hashing of k-mers. bioRxiv, 2022. URL: https://doi.org/10.1101/2022.01.15.476199.
  22. Amatur Rahman and Paul Medevedev. Representation of k-mer sets using spectrum-preserving string sets. Journal of Computational Biology, 28(4):381-394, 2021. Google Scholar
  23. Sebastian Schmidt, Shahbaz Khan, Jarno Alanko, and Alexandru I Tomescu. Matchtigs: minimum plain text representation of kmer sets. bioRxiv, 2021. Google Scholar
  24. Esko Ukkonen. On-line construction of suffix trees. Algorithmica, 14(3):249-260, 1995. Google Scholar
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail