Compression Algorithm for Colored de Bruijn Graphs

Authors Amatur Rahman , Yoann Dufresne, Paul Medvedev



PDF
Thumbnail PDF

File

LIPIcs.WABI.2023.17.pdf
  • Filesize: 0.77 MB
  • 14 pages

Document Identifiers

Author Details

Amatur Rahman
  • Department of Computer Science and Engineering, The Pennsylvania State University, University Park, PA, USA
Yoann Dufresne
  • Institut Pasteur, Université Paris Cité, G5 Sequence Bioinformatics, Paris, France
  • Institut Pasteur, Université Paris Cité, Bioinformatics and Biostatistics Hub, F-75015 Paris, France
Paul Medvedev
  • Department of Computer Science and Engineering, The Pennsylvania State University, University Park, PA, USA
  • Department of Biochemistry and Molecular Biology, The Pennsylvania State University, University Park, PA, USA
  • Huck Institutes of the Life Sciences, The Pennsylvania State University, University Park, PA, USA

Acknowledgements

We would like to thank R. Chikhi for helpful discussions.

Cite As Get BibTex

Amatur Rahman, Yoann Dufresne, and Paul Medvedev. Compression Algorithm for Colored de Bruijn Graphs. In 23rd International Workshop on Algorithms in Bioinformatics (WABI 2023). Leibniz International Proceedings in Informatics (LIPIcs), Volume 273, pp. 17:1-17:14, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2023) https://doi.org/10.4230/LIPIcs.WABI.2023.17

Abstract

A colored de Bruijn graph (also called a set of k-mer sets), is a set of k-mers with every k-mer assigned a set of colors. Colored de Bruijn graphs are used in a variety of applications, including variant calling, genome assembly, and database search. However, their size has posed a scalability challenge to algorithm developers and users. There have been numerous indexing data structures proposed that allow to store the graph compactly while supporting fast query operations. However, disk compression algorithms, which do not need to support queries on the compressed data and can thus be more space-efficient, have received little attention. The dearth of specialized compression tools has been a detriment to tool developers, tool users, and reproducibility efforts. In this paper, we develop a new tool that compresses colored de Bruijn graphs to disk, building on previous ideas for compression of k-mer sets and indexing colored de Bruijn graphs. We test our tool, called ESS-color, on various datasets, including both sequencing data and whole genomes. ESS-color achieves better compression than all evaluated tools and all datasets, with no other tool able to consistently achieve less than 44% space overhead.

Subject Classification

ACM Subject Classification
  • Applied computing → Computational biology
Keywords
  • colored de Bruijn graphs
  • disk compression
  • k-mer sets
  • simplitigs
  • spectrum-preserving string sets

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Jarno N. Alanko, Jaakko Vuohtoniemi, Tommi Mäklin, and Simon J. Puglisi. Themisto: a scalable colored k-mer index for sensitive pseudoalignment against hundreds of thousands of bacterial genomes. Bioinformatics, 2023. Google Scholar
  2. Fatemeh Almodaresi, Prashant Pandey, Michael Ferdman, Rob Johnson, and Rob Patro. An Efficient, Scalable, and Exact Representation of High-Dimensional Color Information Enabled Using de Bruijn Graph Search. Journal of Computational Biology, 27(4):485-499, 2020. Google Scholar
  3. Fatemeh Almodaresi, Prashant Pandey, and Rob Patro. Rainbowfish: A succinct colored de Bruijn graph representation. In 17th International Workshop on Algorithms in Bioinformatics (WABI 2017), volume 88 of Leibniz International Proceedings in Informatics (LIPIcs), pages 18:1-18:15, 2017. Google Scholar
  4. Phelim Bradley, Henk C Den Bakker, Eduardo PC Rocha, Gil McVean, and Zamin Iqbal. Ultrafast search of all deposited bacterial and viral genomic data. Nature biotechnology, 37(2):152-159, 2019. Google Scholar
  5. Karel Břinda, Michael Baym, and Gregory Kucherov. Simplitigs as an efficient and scalable representation of de Bruijn graphs. Genome biology, 22:1-24, 2021. Google Scholar
  6. Karel Břinda. Novel computational techniques for mapping and classifying Next-Generation Sequencing data. PhD thesis, Université Paris-Est, 2016. Google Scholar
  7. Andrea Cracco and Alexandru I. Tomescu. Extremely-fast construction and querying of compacted and colored de Bruijn graphs with GGCAT. BioRxiv, 2022. Google Scholar
  8. Zbigniew J Czech, George Havas, and Bohdan S Majewski. An optimal algorithm for generating minimal perfect hash functions. Information processing letters, 43(5):257-264, 1992. Google Scholar
  9. Daniel Danciu, Mikhail Karasikov, Harun Mustafa, André Kahles, and Gunnar Rätsch. Topology-based sparsification of graph annotations. Bioinformatics, 37(Supplement_1):i169-i176, 2021. Google Scholar
  10. Yoann Dufresne, Teo Lemane, Pierre Marijon, Pierre Peterlongo, Amatur Rahman, Marek Kokot, Paul Medvedev, Sebastian Deorowicz, and Rayan Chikhi. The K-mer File Format: a standardized and compact disk representation of sets of k-mers. Bioinformatics, 38(18):4423-4425, 2022. Google Scholar
  11. Robert S Harris and Paul Medvedev. Improved representation of Sequence Bloom Trees. Bioinformatics, 36(3):721-727, 2020. Google Scholar
  12. Guillaume Holley and Páll Melsted. Bifrost: highly parallel construction and indexing of colored and compacted de Bruijn graphs. Genome biology, 21(1):1-20, 2020. Google Scholar
  13. Zamin Iqbal, Mario Caccamo, Isaac Turner, Paul Flicek, and Gil McVean. De novo assembly and genotyping of variants using colored de Bruijn graphs. Nature genetics, 44(2):226-232, 2012. Google Scholar
  14. Mikhail Karasikov, Harun Mustafa, Gunnar Rätsch, and André Kahles. Lossless indexing with counting de Bruijn graphs. In Research in Computational Molecular Biology, pages 374-376, 2022. Google Scholar
  15. Jamshed Khan, Marek Kokot, Sebastian Deorowicz, and Rob Patro. Scalable, ultra-fast, and low-memory construction of compacted de Bruijn graphs with Cuttlefish 2. Genome biology, 23(1):190, 2022. Google Scholar
  16. Kazushi Kitaya and Tetsuo Shibuya. Compression of multiple k-mer sets by iterative SPSS decomposition. In 21st International Workshop on Algorithms in Bioinformatics (WABI 2021). Schloss Dagstuhl-Leibniz-Zentrum für Informatik, 2021. Google Scholar
  17. Marek Kokot, Maciej Długosz, and Sebastian Deorowicz. KMC 3: counting and manipulating k-mer statistics. Bioinformatics, 33(17):2759-2761, 2017. Google Scholar
  18. Camille Marchet. Data-structures for sets of k-mer sets: what’s new since 2020. Blog post, 2022. URL: https://kamimrcht.github.io/webpage/sets_kmer_sets.html.
  19. Camille Marchet, Christina Boucher, Simon J. Puglisi, Paul Medvedev, Mikaël Salson, and Rayan Chikhi. Data structures based on k-mers for querying large collections of sequencing data sets. Genome Research, 31(1):1-12, 2020. Google Scholar
  20. Camille Marchet, Zamin Iqbal, Daniel Gautheret, Mikaël Salson, and Rayan Chikhi. REINDEER: efficient indexing of k-mer presence and abundance in sequencing datasets. Bioinformatics, 36(Supplement_1):i177-i185, 2020. Google Scholar
  21. Joan Mas-Lloret, Mireia Obón-Santacana, Gemma Ibáñez-Sanz, Elisabet Guinó, Miguel L Pato, Francisco Rodriguez-Moranta, Alfredo Mata, Ana García-Rodríguez, Victor Moreno, and Ville Nikolai Pimenoff. Gut microbiome diversity detected by high-coverage 16S and shotgun sequencing of paired stool and colon sample. Scientific data, 7(1):92, 2020. Google Scholar
  22. Louis Papageorgiou, Picasi Eleni, Sofia Raftopoulou, Meropi Mantaiou, Vasileios Megalooikonomou, and Dimitrios Vlachakis. Genomic big data hitting the storage bottleneck. EMBnet. journal, 24, 2018. Google Scholar
  23. Rob Patro, Geet Duggal, Michael I Love, Rafael A Irizarry, and Carl Kingsford. Salmon provides fast and bias-aware quantification of transcript expression. Nature methods, 14(4):417-419, 2017. Google Scholar
  24. Giulio Ermanno Pibiri. Sparse and skew hashing of k-mers. Bioinformatics, 38(Supplement_1):i185-i194, 2022. Google Scholar
  25. Amatur Rahman, Rayan Chikhi, and Paul Medvedev. Disk compression of k-mer sets. Algorithms for Molecular Biology, 16(1):1-14, 2021. Google Scholar
  26. Amatur Rahman and Paul Medevedev. Representation of k-mer sets using spectrum-preserving string sets. Journal of Computational Biology, 28(4):381-394, 2021. Google Scholar
  27. Rajeev Raman, Venkatesh Raman, and S Srinivasa Rao. Succinct dynamic data structures. In Algorithms and Data Structures: 7th International Workshop, WADS 2001 Providence, RI, USA, August 8-10, 2001 Proceedings 7, pages 426-437. Springer, 2001. Google Scholar
  28. Brad Solomon and Carl Kingsford. Fast search of thousands of short-read sequencing experiments. Nature biotechnology, 34(3):300-302, 2016. Google Scholar
  29. Roland Wittler. Alignment-and reference-free phylogenomics with colored de Bruijn graphs. Algorithms for Molecular Biology, 15:1-12, 2020. Google Scholar
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail