Compression Algorithm for Colored de Bruijn Graphs

Rahman, Amatur; Dufresne, Yoann; Medvedev, Paul

doi:10.4230/LIPIcs.WABI.2023.17

File

LIPIcs.WABI.2023.17.pdf

Filesize: 0.77 MB
14 pages

Document Identifiers

DOI: 10.4230/LIPIcs.WABI.2023.17
URN: urn:nbn:de:0030-drops-186434

Author Details

Amatur Rahman

Department of Computer Science and Engineering, The Pennsylvania State University, University Park, PA, USA

Yoann Dufresne

Institut Pasteur, Université Paris Cité, G5 Sequence Bioinformatics, Paris, France
Institut Pasteur, Université Paris Cité, Bioinformatics and Biostatistics Hub, F-75015 Paris, France

Paul Medvedev

Department of Computer Science and Engineering, The Pennsylvania State University, University Park, PA, USA
Department of Biochemistry and Molecular Biology, The Pennsylvania State University, University Park, PA, USA
Huck Institutes of the Life Sciences, The Pennsylvania State University, University Park, PA, USA

Acknowledgements

We would like to thank R. Chikhi for helpful discussions.

Cite AsGet BibTex

Amatur Rahman, Yoann Dufresne, and Paul Medvedev. Compression Algorithm for Colored de Bruijn Graphs. In 23rd International Workshop on Algorithms in Bioinformatics (WABI 2023). Leibniz International Proceedings in Informatics (LIPIcs), Volume 273, pp. 17:1-17:14, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2023)
https://doi.org/10.4230/LIPIcs.WABI.2023.17

Abstract

A colored de Bruijn graph (also called a set of k-mer sets), is a set of k-mers with every k-mer assigned a set of colors. Colored de Bruijn graphs are used in a variety of applications, including variant calling, genome assembly, and database search. However, their size has posed a scalability challenge to algorithm developers and users. There have been numerous indexing data structures proposed that allow to store the graph compactly while supporting fast query operations. However, disk compression algorithms, which do not need to support queries on the compressed data and can thus be more space-efficient, have received little attention. The dearth of specialized compression tools has been a detriment to tool developers, tool users, and reproducibility efforts. In this paper, we develop a new tool that compresses colored de Bruijn graphs to disk, building on previous ideas for compression of k-mer sets and indexing colored de Bruijn graphs. We test our tool, called ESS-color, on various datasets, including both sequencing data and whole genomes. ESS-color achieves better compression than all evaluated tools and all datasets, with no other tool able to consistently achieve less than 44% space overhead.

Subject Classification

ACM Subject Classification

Applied computing → Computational biology

Keywords

colored de Bruijn graphs
disk compression
k-mer sets
simplitigs
spectrum-preserving string sets

Metrics

Access Statistics
Total Accesses (updated on a weekly basis)

0

PDF Downloads

0

Metadata Views

References

Jarno N. Alanko, Jaakko Vuohtoniemi, Tommi Mäklin, and Simon J. Puglisi. Themisto: a scalable colored k-mer index for sensitive pseudoalignment against hundreds of thousands of bacterial genomes. Bioinformatics, 2023.
Fatemeh Almodaresi, Prashant Pandey, Michael Ferdman, Rob Johnson, and Rob Patro. An Efficient, Scalable, and Exact Representation of High-Dimensional Color Information Enabled Using de Bruijn Graph Search. Journal of Computational Biology, 27(4):485-499, 2020.
Fatemeh Almodaresi, Prashant Pandey, and Rob Patro. Rainbowfish: A succinct colored de Bruijn graph representation. In 17th International Workshop on Algorithms in Bioinformatics (WABI 2017), volume 88 of Leibniz International Proceedings in Informatics (LIPIcs), pages 18:1-18:15, 2017.
Phelim Bradley, Henk C Den Bakker, Eduardo PC Rocha, Gil McVean, and Zamin Iqbal. Ultrafast search of all deposited bacterial and viral genomic data. Nature biotechnology, 37(2):152-159, 2019.
Karel Břinda, Michael Baym, and Gregory Kucherov. Simplitigs as an efficient and scalable representation of de Bruijn graphs. Genome biology, 22:1-24, 2021.
Karel Břinda. Novel computational techniques for mapping and classifying Next-Generation Sequencing data. PhD thesis, Université Paris-Est, 2016.
Andrea Cracco and Alexandru I. Tomescu. Extremely-fast construction and querying of compacted and colored de Bruijn graphs with GGCAT. BioRxiv, 2022.
Zbigniew J Czech, George Havas, and Bohdan S Majewski. An optimal algorithm for generating minimal perfect hash functions. Information processing letters, 43(5):257-264, 1992.
Daniel Danciu, Mikhail Karasikov, Harun Mustafa, André Kahles, and Gunnar Rätsch. Topology-based sparsification of graph annotations. Bioinformatics, 37(Supplement_1):i169-i176, 2021.
Yoann Dufresne, Teo Lemane, Pierre Marijon, Pierre Peterlongo, Amatur Rahman, Marek Kokot, Paul Medvedev, Sebastian Deorowicz, and Rayan Chikhi. The K-mer File Format: a standardized and compact disk representation of sets of k-mers. Bioinformatics, 38(18):4423-4425, 2022.
Robert S Harris and Paul Medvedev. Improved representation of Sequence Bloom Trees. Bioinformatics, 36(3):721-727, 2020.
Guillaume Holley and Páll Melsted. Bifrost: highly parallel construction and indexing of colored and compacted de Bruijn graphs. Genome biology, 21(1):1-20, 2020.
Zamin Iqbal, Mario Caccamo, Isaac Turner, Paul Flicek, and Gil McVean. De novo assembly and genotyping of variants using colored de Bruijn graphs. Nature genetics, 44(2):226-232, 2012.
Mikhail Karasikov, Harun Mustafa, Gunnar Rätsch, and André Kahles. Lossless indexing with counting de Bruijn graphs. In Research in Computational Molecular Biology, pages 374-376, 2022.
Jamshed Khan, Marek Kokot, Sebastian Deorowicz, and Rob Patro. Scalable, ultra-fast, and low-memory construction of compacted de Bruijn graphs with Cuttlefish 2. Genome biology, 23(1):190, 2022.
Kazushi Kitaya and Tetsuo Shibuya. Compression of multiple k-mer sets by iterative SPSS decomposition. In 21st International Workshop on Algorithms in Bioinformatics (WABI 2021). Schloss Dagstuhl-Leibniz-Zentrum für Informatik, 2021.
Marek Kokot, Maciej Długosz, and Sebastian Deorowicz. KMC 3: counting and manipulating k-mer statistics. Bioinformatics, 33(17):2759-2761, 2017.
Camille Marchet. Data-structures for sets of k-mer sets: what’s new since 2020. Blog post, 2022. URL: https://kamimrcht.github.io/webpage/sets_kmer_sets.html.
Camille Marchet, Christina Boucher, Simon J. Puglisi, Paul Medvedev, Mikaël Salson, and Rayan Chikhi. Data structures based on k-mers for querying large collections of sequencing data sets. Genome Research, 31(1):1-12, 2020.
Camille Marchet, Zamin Iqbal, Daniel Gautheret, Mikaël Salson, and Rayan Chikhi. REINDEER: efficient indexing of k-mer presence and abundance in sequencing datasets. Bioinformatics, 36(Supplement_1):i177-i185, 2020.
Joan Mas-Lloret, Mireia Obón-Santacana, Gemma Ibáñez-Sanz, Elisabet Guinó, Miguel L Pato, Francisco Rodriguez-Moranta, Alfredo Mata, Ana García-Rodríguez, Victor Moreno, and Ville Nikolai Pimenoff. Gut microbiome diversity detected by high-coverage 16S and shotgun sequencing of paired stool and colon sample. Scientific data, 7(1):92, 2020.
Louis Papageorgiou, Picasi Eleni, Sofia Raftopoulou, Meropi Mantaiou, Vasileios Megalooikonomou, and Dimitrios Vlachakis. Genomic big data hitting the storage bottleneck. EMBnet. journal, 24, 2018.
Rob Patro, Geet Duggal, Michael I Love, Rafael A Irizarry, and Carl Kingsford. Salmon provides fast and bias-aware quantification of transcript expression. Nature methods, 14(4):417-419, 2017.
Giulio Ermanno Pibiri. Sparse and skew hashing of k-mers. Bioinformatics, 38(Supplement_1):i185-i194, 2022.
Amatur Rahman, Rayan Chikhi, and Paul Medvedev. Disk compression of k-mer sets. Algorithms for Molecular Biology, 16(1):1-14, 2021.
Amatur Rahman and Paul Medevedev. Representation of k-mer sets using spectrum-preserving string sets. Journal of Computational Biology, 28(4):381-394, 2021.
Rajeev Raman, Venkatesh Raman, and S Srinivasa Rao. Succinct dynamic data structures. In Algorithms and Data Structures: 7th International Workshop, WADS 2001 Providence, RI, USA, August 8-10, 2001 Proceedings 7, pages 426-437. Springer, 2001.
Brad Solomon and Carl Kingsford. Fast search of thousands of short-read sequencing experiments. Nature biotechnology, 34(3):300-302, 2016.
Roland Wittler. Alignment-and reference-free phylogenomics with colored de Bruijn graphs. Algorithms for Molecular Biology, 15:1-12, 2020.