On Weighted k-mer Dictionaries

Author Giulio Ermanno Pibiri



PDF
Thumbnail PDF

File

LIPIcs.WABI.2022.9.pdf
  • Filesize: 0.92 MB
  • 20 pages

Document Identifiers

Author Details

Giulio Ermanno Pibiri
  • Ca' Foscari University of Venice, Venice, Italy
  • ISTI-CNR, Pisa, Italy

Cite As Get BibTex

Giulio Ermanno Pibiri. On Weighted k-mer Dictionaries. In 22nd International Workshop on Algorithms in Bioinformatics (WABI 2022). Leibniz International Proceedings in Informatics (LIPIcs), Volume 242, pp. 9:1-9:20, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2022) https://doi.org/10.4230/LIPIcs.WABI.2022.9

Abstract

We consider the problem of representing a set of k-mers and their abundance counts, or weights, in compressed space so that assessing membership and retrieving the weight of a k-mer is efficient. The representation is called a weighted dictionary of k-mers and finds application in numerous tasks in Bioinformatics that usually count k-mers as a pre-processing step. In fact, k-mer counting tools produce very large outputs that may result in a severe bottleneck for subsequent processing.
In this work we extend the recently introduced SSHash dictionary (Pibiri, Bioinformatics 2022) to also store compactly the weights of the k-mers. From a technical perspective, we exploit the order of the k-mers represented in SSHash to encode runs of weights, hence allowing (several times) better compression than the empirical entropy of the weights. We also study the problem of reducing the number of runs in the weights to improve compression even further and illustrate a lower bound for this problem. We propose an efficient, greedy, algorithm to reduce the number of runs and show empirically that it performs well, i.e., very similarly to the lower bound. Lastly, we corroborate our findings with experiments on real-world datasets and comparison with competitive alternatives. Up to date, SSHash is the only k-mer dictionary that is exact, weighted, associative, fast, and small.

Subject Classification

ACM Subject Classification
  • Applied computing → Bioinformatics
Keywords
  • K-Mers
  • Weights
  • Compression
  • Hashing

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Fatemeh Almodaresi, Hirak Sarkar, Avi Srivastava, and Rob Patro. A space and time-efficient index for the compacted colored de Bruijn graph. Bioinformatics, 34(13):i169-i177, 2018. Google Scholar
  2. Uwe Baier, Timo Beller, and Enno Ohlebusch. Graphical pan-genome analysis with compressed suffix trees and the burrows-wheeler transform. Bioinformatics, 32(4):497-504, 2016. Google Scholar
  3. Anton Bankevich, Sergey Nurk, Dmitry Antipov, Alexey A Gurevich, Mikhail Dvorkin, Alexander S Kulikov, Valery M Lesin, Sergey I Nikolenko, Son Pham, Andrey D Prjibelski, et al. Spades: a new genome assembly algorithm and its applications to single-cell sequencing. Journal of computational biology, 19(5):455-477, 2012. Google Scholar
  4. Alexander Bowe, Taku Onodera, Kunihiko Sadakane, and Tetsuo Shibuya. Succinct de Bruijn graphs. In International Workshop on Algorithms in Bioinformatics (WABI), pages 225-235. Springer, 2012. Google Scholar
  5. Michael Burrows and David Wheeler. A block-sorting lossless data compression algorithm. In Digital SRC Research Report. Citeseer, 1994. Google Scholar
  6. Rayan Chikhi, Antoine Limasset, Shaun Jackman, Jared T Simpson, and Paul Medvedev. On the representation of de Bruijn graphs. In International conference on Research in computational molecular biology, pages 35-55. Springer, 2014. URL: https://github.com/jts/dbgfm.
  7. Rayan Chikhi, Antoine Limasset, and Paul Medvedev. Compacting de Bruijn graphs from sequencing data quickly and in low memory. Bioinformatics, 32(12):i201-i208, 2016. URL: https://github.com/GATB/bcalm.
  8. Sebastian Deorowicz, Marek Kokot, Szymon Grabowski, and Agnieszka Debudaj-Grabysz. Kmc 2: fast and resource-frugal k-mer counting. Bioinformatics, 31(10):1569-1576, 2015. Google Scholar
  9. Peter Elias. Efficient storage and retrieval by content and address of static files. Journal of the ACM, 21(2):246-260, 1974. Google Scholar
  10. Robert Mario Fano. On the number of bits required to implement an associative memory. Memorandum 61, Computer Structures Group, MIT, 1971. Google Scholar
  11. Paolo Ferragina and Giovanni Manzini. Opportunistic data structures with applications. In Proceedings 41st Annual Symposium on Foundations of Computer Science, pages 390-398. IEEE, 2000. Google Scholar
  12. Giuseppe Italiano, Nicola Prezza, Blerina Sinaimeri, and Rossano Venturini. Compressed weighted de Bruijn graphs. In 32nd Annual Symposium on Combinatorial Pattern Matching (CPM 2021), volume 191, pages 16:1-16:16, 2021. URL: https://github.com/nicolaprezza/cw-dBg.
  13. Shaun D Jackman, Benjamin P Vandervalk, Hamid Mohamadi, Justin Chu, Sarah Yeo, S Austin Hammond, Golnaz Jahesh, Hamza Khan, Lauren Coombe, Rene L Warren, et al. Abyss 2.0: resource-efficient assembly of large genomes using a bloom filter. Genome research, 27(5):768-777, 2017. Google Scholar
  14. Mikhail Karasikov, Harun Mustafa, Gunnar Rätsch, and André Kahles. Lossless indexing with counting de bruijn graphs. bioRxiv, 2021. Google Scholar
  15. Parsoa Khorsand and Fereydoun Hormozdiari. Nebula: ultra-efficient mapping-free structural variant genotyper. Nucleic acids research, 49(8):e47-e47, 2021. Google Scholar
  16. Danyang Ma, Simon J Puglisi, Rajeev Raman, and Bella Zhukova. On elias-fano for rank queries in fm-indexes. In 2021 Data Compression Conference (DCC), pages 223-232. IEEE, 2021. Google Scholar
  17. Guillaume Marçais and Carl Kingsford. A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics, 27(6):764-770, 2011. Google Scholar
  18. Camille Marchet, Zamin Iqbal, Daniel Gautheret, Mikaël Salson, and Rayan Chikhi. Reindeer: efficient indexing of k-mer presence and abundance in sequencing datasets. Bioinformatics, 36(Supplement_1):i177-i185, 2020. Google Scholar
  19. Shoshana Marcus, Hayan Lee, and Michael C Schatz. Splitmem: a graphical algorithm for pan-genome analysis with suffix skips. Bioinformatics, 30(24):3476-3483, 2014. Google Scholar
  20. Giuseppe Ottaviano and Rossano Venturini. Partitioned elias-fano indexes. In Proceedings of the 37th international ACM SIGIR conference on Research & development in information retrieval, pages 273-282, 2014. Google Scholar
  21. Prashant Pandey, Michael A Bender, Rob Johnson, and Rob Patro. A general-purpose counting filter: Making every bit count. In Proceedings of the 2017 ACM international conference on Management of Data, pages 775-787, 2017. Google Scholar
  22. Prashant Pandey, Michael A Bender, Rob Johnson, and Rob Patro. deBGR: an efficient and near-exact representation of the weighted de Bruijn graph. Bioinformatics, 33(14):i133-i141, 2017. Google Scholar
  23. Prashant Pandey, Michael A Bender, Rob Johnson, and Rob Patro. Squeakr: an exact and approximate k-mer counting system. Bioinformatics, 34(4):568-575, 2018. Google Scholar
  24. Raffaele Perego, Giulio Ermanno Pibiri, and Rossano Venturini. Compressed indexes for fast search of semantic data. IEEE Trans. Knowl. Data Eng., 33(9):3187-3198, 2021. Google Scholar
  25. Giulio Ermanno Pibiri. Sparse and skew hashing of k-mers. Bioinformatics, 38(Supplement_1):i185-i194, June 2022. URL: https://doi.org/10.1093/bioinformatics/btac245.
  26. Giulio Ermanno Pibiri and Roberto Trani. Parallel and external-memory construction of minimal perfect hash functions with PTHash. CoRR, abs/2106.02350, 2021. URL: http://arxiv.org/abs/2106.02350.
  27. Giulio Ermanno Pibiri and Roberto Trani. PTHash: Revisiting FCH minimal perfect hashing. In SIGIR '21: The 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, Virtual Event, Canada, July 11-15, 2021, pages 1339-1348, 2021. Google Scholar
  28. Giulio Ermanno Pibiri and Rossano Venturini. Clustered Elias-Fano indexes. ACM Trans. Inf. Syst., 36(1):2:1-2:33, 2017. Google Scholar
  29. Giulio Ermanno Pibiri and Rossano Venturini. Efficient data structures for massive n-gram datasets. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 615-624, 2017. Google Scholar
  30. Giulio Ermanno Pibiri and Rossano Venturini. Handling massive N-gram datasets efficiently. ACM Trans. Inf. Syst., 37(2):25:1-25:41, 2019. Google Scholar
  31. Giulio Ermanno Pibiri and Rossano Venturini. On optimally partitioning variable-byte codes. IEEE Trans. Knowl. Data Eng., 32(9):1812-1823, 2020. Google Scholar
  32. Giulio Ermanno Pibiri and Rossano Venturini. Techniques for inverted index compression. ACM Comput. Surv., 53(6):125:1-125:36, 2021. Google Scholar
  33. Amatur Rahman and Paul Medvedev. Representation of k-mer sets using spectrum-preserving string sets. In International Conference on Research in Computational Molecular Biology, pages 152-168. Springer, 2020. URL: https://github.com/medvedevgroup/UST.
  34. Guillaume Rizk, Dominique Lavenier, and Rayan Chikhi. Dsk: k-mer counting with very low memory usage. Bioinformatics, 29(5):652-653, 2013. Google Scholar
  35. Michael Roberts, Wayne Hayes, Brian R Hunt, Stephen M Mount, and James A Yorke. Reducing storage requirements for biological sequence comparison. Bioinformatics, 20(18):3363-3369, 2004. Google Scholar
  36. Mirko Rossi, Mickael Santos Da Silva, Bruno Filipe Ribeiro-Gonçalves, Diogo Nuno Silva, Miguel Paulo Machado, Mónica Oleastro, Vítor Borges, Joana Isidro, Luis Viera, Jani Halkilahti, Anniina Jaakkonen, Federica Palma, Saara Salmenlinna, Marjaana Hakkinen, Javier Garaizar, Joseba Bikandi, Friederike Hilbert, and João André Carriço. INNUENDO whole genome and core genome MLST schemas and datasets for Salmonella enterica, July 2018. URL: https://doi.org/10.5281/zenodo.1323684.
  37. Kristoffer Sahlin. Effective sequence similarity detection with strobemers. Genome research, 31(11):2080-2094, 2021. Google Scholar
  38. Kristoffer Sahlin. Strobemers: an alternative to k-mers for sequence comparison. bioRxiv, 2021. Google Scholar
  39. Yoshihiro Shibuya, Djamal Belazzougui, and Gregory Kucherov. Set-min sketch: a probabilistic map for power-law distributions with application to k-mer annotation. Journal of Computational Biology, 29(2):140-154, 2022. Google Scholar
  40. Yoshihiro Shibuya, Djamal Belazzougui, and Gregory Kucherov. Space-efficient representation of genomic k-mer count tables. Algorithms for Molecular Biology, 17(1):1-15, 2022. URL: https://github.com/yhhshb/locom.
  41. Daniel S Standage, C Titus Brown, and Fereydoun Hormozdiari. Kevlar: a mapping-free framework for accurate discovery of de novo variants. Iscience, 18:28-36, 2019. Google Scholar
  42. Sebastiano Vigna. Quasi-succinct indices. In Proceedings of the sixth ACM international conference on Web search and data mining, pages 83-92, 2013. Google Scholar
  43. Derrick E Wood and Steven L Salzberg. Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome biology, 15(3):1-12, 2014. Google Scholar
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail