Disk Compression of k-mer Sets

Rahman, Amatur; Chikhi, Rayan; Medvedev, Paul

doi:10.4230/LIPIcs.WABI.2020.16

Abstract

K-mer based methods have become prevalent in many areas of bioinformatics. In applications such as database search, they often work with large multi-terabyte-sized datasets. Storing such large datasets is a detriment to tool developers, tool users, and reproducibility efforts. General purpose compressors like gzip, or those designed for read data, are sub-optimal because they do not take into account the specific redundancy pattern in k-mer sets. In our earlier work (Rahman and Medvedev, RECOMB 2020), we presented an algorithm UST-Compress that uses a spectrum-preserving string set representation to compress a set of k-mers to disk. In this paper, we present two improved methods for disk compression of k-mer sets, called ESS-Compress and ESS-Tip-Compress. They use a more relaxed notion of string set representation to further remove redundancy from the representation of UST-Compress. We explore their behavior both theoretically and on real data. We show that they improve the compression sizes achieved by UST-Compress by up to 27 percent, across a breadth of datasets. We also derive lower bounds on how well this type of compression strategy can hope to do.

URL: https://github.com/cosmo-team/cosmo/tree/VARI.
URL: https://github.com/prophyle/prophasm.
Jørgen Bang-Jensen and Gregory Z Gutin. Digraphs: theory, algorithms and applications. Springer Science & Business Media, 2008.
Anton Bankevich, Sergey Nurk, Dmitry Antipov, Alexey A Gurevich, Mikhail Dvorkin, Alexander S Kulikov, Valery M Lesin, Sergey I Nikolenko, Son Pham, Andrey D Prjibelski, et al. SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. Journal of computational biology, 19(5):455-477, 2012.
Timo Bingmann, Phelim Bradley, Florian Gauger, and Zamin Iqbal. COBS: a compact bit-sliced signature index. arXiv preprint arXiv:1905.09624, 2019.
Alexander Bowe, Taku Onodera, Kunihiko Sadakane, and Tetsuo Shibuya. Succinct de bruijn graphs. In Proceedings of the 12th International Conference on Algorithms in Bioinformatics, volume 7534 of LNCS, page 225–235. Springer, 2012.
Phelim Bradley, Henk C den Bakker, Eduardo PC Rocha, Gil McVean, and Zamin Iqbal. Ultrafast search of all deposited bacterial and viral genomic data. Nature biotechnology, 37(2):152, 2019.
Karel Břinda, Michael Baym, and Gregory Kucherov. Simplitigs as an efficient and scalable representation of de Bruijn graphs. bioRxiv, 2020. URL: https://doi.org/10.1101/2020.01.12.903443.
Karel Břinda. Novel computational techniques for mapping and classifying Next-Generation Sequencing data. PhD thesis, Université Paris-Est, November 2016. URL: https://doi.org/10.5281/zenodo.1045317.
Rayan Chikhi, Jan Holub, and Paul Medvedev. Data structures to represent sets of k-long DNA sequences. arXiv:1903.12312 [cs, q-bio], 2019.
Rayan Chikhi, Antoine Limasset, Shaun Jackman, Jared T Simpson, and Paul Medvedev. On the representation of de Bruijn graphs. In Research in Computational Molecular Biology, RECOMB 2014, volume 8394 of Lecture Notes in Computer Science, pages 35-55. Springer, 2014.
Rayan Chikhi, Antoine Limasset, and Paul Medvedev. Compacting de Bruijn graphs from sequencing data quickly and in low memory. Bioinformatics, 32(12):i201-i208, 2016.
Temesgen Hailemariam Dadi, Enrico Siragusa, Vitor C Piro, Andreas Andrusch, Enrico Seiler, Bernhard Y Renard, and Knut Reinert. DREAM-Yara: An exact read mapper for very large databases with short update time. Bioinformatics, 34(17):i766-i772, 2018.
Luca Denti, Marco Previtali, Giulia Bernardini, Alexander Schönhuth, and Paola Bonizzoni. MALVA: genotyping by Mapping-free ALlele detection of known VAriants. iScience, 18:20-27, 2019.
R. S. Harris and P. Medvedev. Improved Representation of Sequence Bloom Trees. Bioinformatics, 36(3):721-727, 2020.
Mikel Hernaez, Dmitri Pavlichin, Tsachy Weissman, and Idoia Ochoa. Genomic Data Compression. Annual Review of Biomedical Data Science, 2, 2019.
Morteza Hosseini, Diogo Pratas, and Armando Pinho. A survey on data compression methods for biological sequences. Information, 7(4):56, 2016.
Costas S Iliopoulos, Ritu Kundu, and Solon P Pissis. Efficient pattern matching in elastic-degenerate texts. In International Conference on Language and Automata Theory and Applications, pages 131-142. Springer, 2017.
Marek Kokot, Maciej Długosz, and Sebastian Deorowicz. KMC 3: counting and manipulating k-mer statistics. Bioinformatics, 33(17):2759-2761, 2017.
Kirill Kryukov, Mahoko Takahashi Ueda, So Nakagawa, and Tadashi Imanishi. Nucleotide Archival Format (NAF) enables efficient lossless reference-free compression of DNA sequences. bioRxiv, page 501130, 2018.
Guillaume Marçais and Carl Kingsford. A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics, 27(6):764-770, 2011.
Camille Marchet, Christina Boucher, Simon J Puglisi, Paul Medvedev, Mikaël Salson, and Rayan Chikhi. Data structures based on k-mers for querying large collections of sequencing datasets. bioRxiv, page 866756, 2019.
Camille Marchet, Zamin Iqbal, Daniel Gautheret, Mikaël Salson, and Rayan Chikhi. Reindeer: efficient indexing of k-mer presence and abundance in sequencing datasets. bioRxiv, 2020.
Ibrahim Numanagić, James K Bonfield, Faraz Hach, Jan Voges, Jörn Ostermann, Claudio Alberti, Marco Mattavelli, and S Cenk Sahinalp. Comparison of high-throughput sequencing data compression tools. Nature methods, 13(12):1005, 2016.
Brian D Ondov, Todd J Treangen, Páll Melsted, Adam B Mallonee, Nicholas H Bergman, Sergey Koren, and Adam M Phillippy. Mash: fast genome and metagenome distance estimation using MinHash. Genome biology, 17(1):132, 2016.
Prashant Pandey, Fatemeh Almodaresi, Michael A Bender, Michael Ferdman, Rob Johnson, and Rob Patro. Mantis: A fast, small, and exact large-scale sequence-search index. Cell systems, 7(2):201-207, 2018.
Prashant Pandey, Michael A Bender, Rob Johnson, and Rob Patro. Squeakr: an exact and approximate k-mer counting system. Bioinformatics, 34(4):568-575, 2017.
Armando J Pinho and Diogo Pratas. MFCompress: a compression tool for FASTA and multi-FASTA data. Bioinformatics, 30(1):117-118, 2013.
Amatur Rahman and Paul Medvedev. Representation of k-mer sets using spectrum-preserving string sets. In Research in Computational Molecular Biology - 24th Annual International Conference, RECOMB 2020, Padua, Italy, May 10-13, 2020, Proceedings, volume 12074 of Lecture Notes in Computer Science, pages 152-168. Springer, 2020.
Guillaume Rizk, Dominique Lavenier, and Rayan Chikhi. DSK: k-mer counting with very low memory usage. Bioinformatics, 29(5):652-653, 2013.
Steven L Salzberg, Adam M Phillippy, Aleksey Zimin, Daniela Puiu, Tanja Magoc, Sergey Koren, Todd J Treangen, Michael C Schatz, Arthur L Delcher, Michael Roberts, et al. GAGE: A critical evaluation of genome assemblies and assembly algorithms. Genome research, 22(3):557-567, 2012.
B. Solomon and C. Kingsford. Fast search of thousands of short-read sequencing experiments. Nature biotechnology, 34(3):300-302, 2016.
B. Solomon and C. Kingsford. Improved Search of Large Transcriptomic Sequencing Databases Using Split Sequence Bloom Trees. Journal of Computational Biology, 25(7):755-765, 2018.
Daniel S Standage, C Titus Brown, and Fereydoun Hormozdiari. Kevlar: a mapping-free framework for accurate discovery of de novo variants. iScience, 18:28-36, 2019.
Chen Sun, Robert S. Harris, Rayan Chikhi, and Paul Medvedev. AllSome Sequence Bloom Trees. In Research in Computational Molecular Biology - 21st Annual International Conference, RECOMB 2017, Hong Kong, China, May 3-7, 2017, Proceedings, volume 10229 of Lecture Notes in Computer Science, pages 272-286, 2017.
Chen Sun and Paul Medvedev. Toward fast and accurate snp genotyping from whole genome sequencing data for bedside diagnostics. Bioinformatics, 35(3):415-420, 2018.
Isaac Turner, Kiran V Garimella, Zamin Iqbal, and Gil McVean. Integrating long-range connectivity information into de bruijn graphs. Bioinformatics, 34(15):2556-2565, 2018.
Derrick E Wood and Steven L Salzberg. Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome biology, 15(3):R46, 2014.

Disk Compression of k-mer Sets

Authors Amatur Rahman, Rayan Chikhi, Paul Medvedev

File

Document Identifiers

Author Details

Cite AsGet BibTex

Abstract

Subject Classification

ACM Subject Classification

Keywords

Metrics

References

Disk Compression of k-mer Sets

Authors Amatur Rahman, Rayan Chikhi, Paul Medvedev

File

Document Identifiers

Author Details

Funding

Cite AsGet BibTex

Abstract

Subject Classification

ACM Subject Classification

Keywords

Metrics

Supplementary Materials

References