On Weighted k-mer Dictionaries

Pibiri, Giulio Ermanno

doi:10.4230/LIPIcs.WABI.2022.9

Abstract

We consider the problem of representing a set of k-mers and their abundance counts, or weights, in compressed space so that assessing membership and retrieving the weight of a k-mer is efficient. The representation is called a weighted dictionary of k-mers and finds application in numerous tasks in Bioinformatics that usually count k-mers as a pre-processing step. In fact, k-mer counting tools produce very large outputs that may result in a severe bottleneck for subsequent processing. In this work we extend the recently introduced SSHash dictionary (Pibiri, Bioinformatics 2022) to also store compactly the weights of the k-mers. From a technical perspective, we exploit the order of the k-mers represented in SSHash to encode runs of weights, hence allowing (several times) better compression than the empirical entropy of the weights. We also study the problem of reducing the number of runs in the weights to improve compression even further and illustrate a lower bound for this problem. We propose an efficient, greedy, algorithm to reduce the number of runs and show empirically that it performs well, i.e., very similarly to the lower bound. Lastly, we corroborate our findings with experiments on real-world datasets and comparison with competitive alternatives. Up to date, SSHash is the only k-mer dictionary that is exact, weighted, associative, fast, and small.

Fatemeh Almodaresi, Hirak Sarkar, Avi Srivastava, and Rob Patro. A space and time-efficient index for the compacted colored de Bruijn graph. Bioinformatics, 34(13):i169-i177, 2018.
Uwe Baier, Timo Beller, and Enno Ohlebusch. Graphical pan-genome analysis with compressed suffix trees and the burrows-wheeler transform. Bioinformatics, 32(4):497-504, 2016.
Anton Bankevich, Sergey Nurk, Dmitry Antipov, Alexey A Gurevich, Mikhail Dvorkin, Alexander S Kulikov, Valery M Lesin, Sergey I Nikolenko, Son Pham, Andrey D Prjibelski, et al. Spades: a new genome assembly algorithm and its applications to single-cell sequencing. Journal of computational biology, 19(5):455-477, 2012.
Alexander Bowe, Taku Onodera, Kunihiko Sadakane, and Tetsuo Shibuya. Succinct de Bruijn graphs. In International Workshop on Algorithms in Bioinformatics (WABI), pages 225-235. Springer, 2012.
Michael Burrows and David Wheeler. A block-sorting lossless data compression algorithm. In Digital SRC Research Report. Citeseer, 1994.
Rayan Chikhi, Antoine Limasset, Shaun Jackman, Jared T Simpson, and Paul Medvedev. On the representation of de Bruijn graphs. In International conference on Research in computational molecular biology, pages 35-55. Springer, 2014. URL: https://github.com/jts/dbgfm.
Rayan Chikhi, Antoine Limasset, and Paul Medvedev. Compacting de Bruijn graphs from sequencing data quickly and in low memory. Bioinformatics, 32(12):i201-i208, 2016. URL: https://github.com/GATB/bcalm.
Sebastian Deorowicz, Marek Kokot, Szymon Grabowski, and Agnieszka Debudaj-Grabysz. Kmc 2: fast and resource-frugal k-mer counting. Bioinformatics, 31(10):1569-1576, 2015.
Peter Elias. Efficient storage and retrieval by content and address of static files. Journal of the ACM, 21(2):246-260, 1974.
Robert Mario Fano. On the number of bits required to implement an associative memory. Memorandum 61, Computer Structures Group, MIT, 1971.
Paolo Ferragina and Giovanni Manzini. Opportunistic data structures with applications. In Proceedings 41st Annual Symposium on Foundations of Computer Science, pages 390-398. IEEE, 2000.
Giuseppe Italiano, Nicola Prezza, Blerina Sinaimeri, and Rossano Venturini. Compressed weighted de Bruijn graphs. In 32nd Annual Symposium on Combinatorial Pattern Matching (CPM 2021), volume 191, pages 16:1-16:16, 2021. URL: https://github.com/nicolaprezza/cw-dBg.
Shaun D Jackman, Benjamin P Vandervalk, Hamid Mohamadi, Justin Chu, Sarah Yeo, S Austin Hammond, Golnaz Jahesh, Hamza Khan, Lauren Coombe, Rene L Warren, et al. Abyss 2.0: resource-efficient assembly of large genomes using a bloom filter. Genome research, 27(5):768-777, 2017.
Mikhail Karasikov, Harun Mustafa, Gunnar Rätsch, and André Kahles. Lossless indexing with counting de bruijn graphs. bioRxiv, 2021.
Parsoa Khorsand and Fereydoun Hormozdiari. Nebula: ultra-efficient mapping-free structural variant genotyper. Nucleic acids research, 49(8):e47-e47, 2021.
Danyang Ma, Simon J Puglisi, Rajeev Raman, and Bella Zhukova. On elias-fano for rank queries in fm-indexes. In 2021 Data Compression Conference (DCC), pages 223-232. IEEE, 2021.
Guillaume Marçais and Carl Kingsford. A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics, 27(6):764-770, 2011.
Camille Marchet, Zamin Iqbal, Daniel Gautheret, Mikaël Salson, and Rayan Chikhi. Reindeer: efficient indexing of k-mer presence and abundance in sequencing datasets. Bioinformatics, 36(Supplement_1):i177-i185, 2020.
Shoshana Marcus, Hayan Lee, and Michael C Schatz. Splitmem: a graphical algorithm for pan-genome analysis with suffix skips. Bioinformatics, 30(24):3476-3483, 2014.
Giuseppe Ottaviano and Rossano Venturini. Partitioned elias-fano indexes. In Proceedings of the 37th international ACM SIGIR conference on Research & development in information retrieval, pages 273-282, 2014.
Prashant Pandey, Michael A Bender, Rob Johnson, and Rob Patro. A general-purpose counting filter: Making every bit count. In Proceedings of the 2017 ACM international conference on Management of Data, pages 775-787, 2017.
Prashant Pandey, Michael A Bender, Rob Johnson, and Rob Patro. deBGR: an efficient and near-exact representation of the weighted de Bruijn graph. Bioinformatics, 33(14):i133-i141, 2017.
Prashant Pandey, Michael A Bender, Rob Johnson, and Rob Patro. Squeakr: an exact and approximate k-mer counting system. Bioinformatics, 34(4):568-575, 2018.
Raffaele Perego, Giulio Ermanno Pibiri, and Rossano Venturini. Compressed indexes for fast search of semantic data. IEEE Trans. Knowl. Data Eng., 33(9):3187-3198, 2021.
Giulio Ermanno Pibiri. Sparse and skew hashing of k-mers. Bioinformatics, 38(Supplement_1):i185-i194, June 2022. URL: https://doi.org/10.1093/bioinformatics/btac245.
Giulio Ermanno Pibiri and Roberto Trani. Parallel and external-memory construction of minimal perfect hash functions with PTHash. CoRR, abs/2106.02350, 2021. URL: http://arxiv.org/abs/2106.02350.
Giulio Ermanno Pibiri and Roberto Trani. PTHash: Revisiting FCH minimal perfect hashing. In SIGIR '21: The 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, Virtual Event, Canada, July 11-15, 2021, pages 1339-1348, 2021.
Giulio Ermanno Pibiri and Rossano Venturini. Clustered Elias-Fano indexes. ACM Trans. Inf. Syst., 36(1):2:1-2:33, 2017.
Giulio Ermanno Pibiri and Rossano Venturini. Efficient data structures for massive n-gram datasets. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 615-624, 2017.
Giulio Ermanno Pibiri and Rossano Venturini. Handling massive N-gram datasets efficiently. ACM Trans. Inf. Syst., 37(2):25:1-25:41, 2019.
Giulio Ermanno Pibiri and Rossano Venturini. On optimally partitioning variable-byte codes. IEEE Trans. Knowl. Data Eng., 32(9):1812-1823, 2020.
Giulio Ermanno Pibiri and Rossano Venturini. Techniques for inverted index compression. ACM Comput. Surv., 53(6):125:1-125:36, 2021.
Amatur Rahman and Paul Medvedev. Representation of k-mer sets using spectrum-preserving string sets. In International Conference on Research in Computational Molecular Biology, pages 152-168. Springer, 2020. URL: https://github.com/medvedevgroup/UST.
Guillaume Rizk, Dominique Lavenier, and Rayan Chikhi. Dsk: k-mer counting with very low memory usage. Bioinformatics, 29(5):652-653, 2013.
Michael Roberts, Wayne Hayes, Brian R Hunt, Stephen M Mount, and James A Yorke. Reducing storage requirements for biological sequence comparison. Bioinformatics, 20(18):3363-3369, 2004.
Mirko Rossi, Mickael Santos Da Silva, Bruno Filipe Ribeiro-Gonçalves, Diogo Nuno Silva, Miguel Paulo Machado, Mónica Oleastro, Vítor Borges, Joana Isidro, Luis Viera, Jani Halkilahti, Anniina Jaakkonen, Federica Palma, Saara Salmenlinna, Marjaana Hakkinen, Javier Garaizar, Joseba Bikandi, Friederike Hilbert, and João André Carriço. INNUENDO whole genome and core genome MLST schemas and datasets for Salmonella enterica, July 2018. URL: https://doi.org/10.5281/zenodo.1323684.
Kristoffer Sahlin. Effective sequence similarity detection with strobemers. Genome research, 31(11):2080-2094, 2021.
Kristoffer Sahlin. Strobemers: an alternative to k-mers for sequence comparison. bioRxiv, 2021.
Yoshihiro Shibuya, Djamal Belazzougui, and Gregory Kucherov. Set-min sketch: a probabilistic map for power-law distributions with application to k-mer annotation. Journal of Computational Biology, 29(2):140-154, 2022.
Yoshihiro Shibuya, Djamal Belazzougui, and Gregory Kucherov. Space-efficient representation of genomic k-mer count tables. Algorithms for Molecular Biology, 17(1):1-15, 2022. URL: https://github.com/yhhshb/locom.
Daniel S Standage, C Titus Brown, and Fereydoun Hormozdiari. Kevlar: a mapping-free framework for accurate discovery of de novo variants. Iscience, 18:28-36, 2019.
Sebastiano Vigna. Quasi-succinct indices. In Proceedings of the sixth ACM international conference on Web search and data mining, pages 83-92, 2013.
Derrick E Wood and Steven L Salzberg. Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome biology, 15(3):1-12, 2014.

On Weighted k-mer Dictionaries

Author Giulio Ermanno Pibiri

File

Document Identifiers

Author Details

Cite AsGet BibTex

Abstract

Subject Classification

ACM Subject Classification

Keywords

Metrics

References

On Weighted k-mer Dictionaries

Author Giulio Ermanno Pibiri

File

Document Identifiers

Author Details

Funding

Cite AsGet BibTex

Abstract

Subject Classification

ACM Subject Classification

Keywords

Metrics

Supplementary Materials

References