Disk Compression of k-mer Sets

Authors Amatur Rahman, Rayan Chikhi, Paul Medvedev

Thumbnail PDF


  • Filesize: 1.35 MB
  • 18 pages

Document Identifiers

Author Details

Amatur Rahman
  • Department of Computer Science and Engineering, Pennsylvania State University, University Park, PA, USA
Rayan Chikhi
  • Department of Computational Biology, C3BI USR 3756 CNRS, Institut Pasteur, Paris, France
Paul Medvedev
  • Pennsylvania State University, University Park, PA, USA

Cite AsGet BibTex

Amatur Rahman, Rayan Chikhi, and Paul Medvedev. Disk Compression of k-mer Sets. In 20th International Workshop on Algorithms in Bioinformatics (WABI 2020). Leibniz International Proceedings in Informatics (LIPIcs), Volume 172, pp. 16:1-16:18, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2020)


K-mer based methods have become prevalent in many areas of bioinformatics. In applications such as database search, they often work with large multi-terabyte-sized datasets. Storing such large datasets is a detriment to tool developers, tool users, and reproducibility efforts. General purpose compressors like gzip, or those designed for read data, are sub-optimal because they do not take into account the specific redundancy pattern in k-mer sets. In our earlier work (Rahman and Medvedev, RECOMB 2020), we presented an algorithm UST-Compress that uses a spectrum-preserving string set representation to compress a set of k-mers to disk. In this paper, we present two improved methods for disk compression of k-mer sets, called ESS-Compress and ESS-Tip-Compress. They use a more relaxed notion of string set representation to further remove redundancy from the representation of UST-Compress. We explore their behavior both theoretically and on real data. We show that they improve the compression sizes achieved by UST-Compress by up to 27 percent, across a breadth of datasets. We also derive lower bounds on how well this type of compression strategy can hope to do.

Subject Classification

ACM Subject Classification
  • Applied computing → Computational biology
  • de Bruijn graphs
  • compression
  • k-mer sets
  • spectrum-preserving string sets


  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    PDF Downloads


  1. URL: https://github.com/cosmo-team/cosmo/tree/VARI.
  2. URL: https://github.com/prophyle/prophasm.
  3. Jørgen Bang-Jensen and Gregory Z Gutin. Digraphs: theory, algorithms and applications. Springer Science & Business Media, 2008. Google Scholar
  4. Anton Bankevich, Sergey Nurk, Dmitry Antipov, Alexey A Gurevich, Mikhail Dvorkin, Alexander S Kulikov, Valery M Lesin, Sergey I Nikolenko, Son Pham, Andrey D Prjibelski, et al. SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. Journal of computational biology, 19(5):455-477, 2012. Google Scholar
  5. Timo Bingmann, Phelim Bradley, Florian Gauger, and Zamin Iqbal. COBS: a compact bit-sliced signature index. arXiv preprint arXiv:1905.09624, 2019. Google Scholar
  6. Alexander Bowe, Taku Onodera, Kunihiko Sadakane, and Tetsuo Shibuya. Succinct de bruijn graphs. In Proceedings of the 12th International Conference on Algorithms in Bioinformatics, volume 7534 of LNCS, page 225–235. Springer, 2012. Google Scholar
  7. Phelim Bradley, Henk C den Bakker, Eduardo PC Rocha, Gil McVean, and Zamin Iqbal. Ultrafast search of all deposited bacterial and viral genomic data. Nature biotechnology, 37(2):152, 2019. Google Scholar
  8. Karel Břinda, Michael Baym, and Gregory Kucherov. Simplitigs as an efficient and scalable representation of de Bruijn graphs. bioRxiv, 2020. URL: https://doi.org/10.1101/2020.01.12.903443.
  9. Karel Břinda. Novel computational techniques for mapping and classifying Next-Generation Sequencing data. PhD thesis, Université Paris-Est, November 2016. URL: https://doi.org/10.5281/zenodo.1045317.
  10. Rayan Chikhi, Jan Holub, and Paul Medvedev. Data structures to represent sets of k-long DNA sequences. arXiv:1903.12312 [cs, q-bio], 2019. Google Scholar
  11. Rayan Chikhi, Antoine Limasset, Shaun Jackman, Jared T Simpson, and Paul Medvedev. On the representation of de Bruijn graphs. In Research in Computational Molecular Biology, RECOMB 2014, volume 8394 of Lecture Notes in Computer Science, pages 35-55. Springer, 2014. Google Scholar
  12. Rayan Chikhi, Antoine Limasset, and Paul Medvedev. Compacting de Bruijn graphs from sequencing data quickly and in low memory. Bioinformatics, 32(12):i201-i208, 2016. Google Scholar
  13. Temesgen Hailemariam Dadi, Enrico Siragusa, Vitor C Piro, Andreas Andrusch, Enrico Seiler, Bernhard Y Renard, and Knut Reinert. DREAM-Yara: An exact read mapper for very large databases with short update time. Bioinformatics, 34(17):i766-i772, 2018. Google Scholar
  14. Luca Denti, Marco Previtali, Giulia Bernardini, Alexander Schönhuth, and Paola Bonizzoni. MALVA: genotyping by Mapping-free ALlele detection of known VAriants. iScience, 18:20-27, 2019. Google Scholar
  15. R. S. Harris and P. Medvedev. Improved Representation of Sequence Bloom Trees. Bioinformatics, 36(3):721-727, 2020. Google Scholar
  16. Mikel Hernaez, Dmitri Pavlichin, Tsachy Weissman, and Idoia Ochoa. Genomic Data Compression. Annual Review of Biomedical Data Science, 2, 2019. Google Scholar
  17. Morteza Hosseini, Diogo Pratas, and Armando Pinho. A survey on data compression methods for biological sequences. Information, 7(4):56, 2016. Google Scholar
  18. Costas S Iliopoulos, Ritu Kundu, and Solon P Pissis. Efficient pattern matching in elastic-degenerate texts. In International Conference on Language and Automata Theory and Applications, pages 131-142. Springer, 2017. Google Scholar
  19. Marek Kokot, Maciej Długosz, and Sebastian Deorowicz. KMC 3: counting and manipulating k-mer statistics. Bioinformatics, 33(17):2759-2761, 2017. Google Scholar
  20. Kirill Kryukov, Mahoko Takahashi Ueda, So Nakagawa, and Tadashi Imanishi. Nucleotide Archival Format (NAF) enables efficient lossless reference-free compression of DNA sequences. bioRxiv, page 501130, 2018. Google Scholar
  21. Guillaume Marçais and Carl Kingsford. A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics, 27(6):764-770, 2011. Google Scholar
  22. Camille Marchet, Christina Boucher, Simon J Puglisi, Paul Medvedev, Mikaël Salson, and Rayan Chikhi. Data structures based on k-mers for querying large collections of sequencing datasets. bioRxiv, page 866756, 2019. Google Scholar
  23. Camille Marchet, Zamin Iqbal, Daniel Gautheret, Mikaël Salson, and Rayan Chikhi. Reindeer: efficient indexing of k-mer presence and abundance in sequencing datasets. bioRxiv, 2020. Google Scholar
  24. Ibrahim Numanagić, James K Bonfield, Faraz Hach, Jan Voges, Jörn Ostermann, Claudio Alberti, Marco Mattavelli, and S Cenk Sahinalp. Comparison of high-throughput sequencing data compression tools. Nature methods, 13(12):1005, 2016. Google Scholar
  25. Brian D Ondov, Todd J Treangen, Páll Melsted, Adam B Mallonee, Nicholas H Bergman, Sergey Koren, and Adam M Phillippy. Mash: fast genome and metagenome distance estimation using MinHash. Genome biology, 17(1):132, 2016. Google Scholar
  26. Prashant Pandey, Fatemeh Almodaresi, Michael A Bender, Michael Ferdman, Rob Johnson, and Rob Patro. Mantis: A fast, small, and exact large-scale sequence-search index. Cell systems, 7(2):201-207, 2018. Google Scholar
  27. Prashant Pandey, Michael A Bender, Rob Johnson, and Rob Patro. Squeakr: an exact and approximate k-mer counting system. Bioinformatics, 34(4):568-575, 2017. Google Scholar
  28. Armando J Pinho and Diogo Pratas. MFCompress: a compression tool for FASTA and multi-FASTA data. Bioinformatics, 30(1):117-118, 2013. Google Scholar
  29. Amatur Rahman and Paul Medvedev. Representation of k-mer sets using spectrum-preserving string sets. In Research in Computational Molecular Biology - 24th Annual International Conference, RECOMB 2020, Padua, Italy, May 10-13, 2020, Proceedings, volume 12074 of Lecture Notes in Computer Science, pages 152-168. Springer, 2020. Google Scholar
  30. Guillaume Rizk, Dominique Lavenier, and Rayan Chikhi. DSK: k-mer counting with very low memory usage. Bioinformatics, 29(5):652-653, 2013. Google Scholar
  31. Steven L Salzberg, Adam M Phillippy, Aleksey Zimin, Daniela Puiu, Tanja Magoc, Sergey Koren, Todd J Treangen, Michael C Schatz, Arthur L Delcher, Michael Roberts, et al. GAGE: A critical evaluation of genome assemblies and assembly algorithms. Genome research, 22(3):557-567, 2012. Google Scholar
  32. B. Solomon and C. Kingsford. Fast search of thousands of short-read sequencing experiments. Nature biotechnology, 34(3):300-302, 2016. Google Scholar
  33. B. Solomon and C. Kingsford. Improved Search of Large Transcriptomic Sequencing Databases Using Split Sequence Bloom Trees. Journal of Computational Biology, 25(7):755-765, 2018. Google Scholar
  34. Daniel S Standage, C Titus Brown, and Fereydoun Hormozdiari. Kevlar: a mapping-free framework for accurate discovery of de novo variants. iScience, 18:28-36, 2019. Google Scholar
  35. Chen Sun, Robert S. Harris, Rayan Chikhi, and Paul Medvedev. AllSome Sequence Bloom Trees. In Research in Computational Molecular Biology - 21st Annual International Conference, RECOMB 2017, Hong Kong, China, May 3-7, 2017, Proceedings, volume 10229 of Lecture Notes in Computer Science, pages 272-286, 2017. Google Scholar
  36. Chen Sun and Paul Medvedev. Toward fast and accurate snp genotyping from whole genome sequencing data for bedside diagnostics. Bioinformatics, 35(3):415-420, 2018. Google Scholar
  37. Isaac Turner, Kiran V Garimella, Zamin Iqbal, and Gil McVean. Integrating long-range connectivity information into de bruijn graphs. Bioinformatics, 34(15):2556-2565, 2018. Google Scholar
  38. Derrick E Wood and Steven L Salzberg. Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome biology, 15(3):R46, 2014. Google Scholar