The mod-minimizer: A Simple and Efficient Sampling Algorithm for Long k-mers

Authors Ragnar Groot Koerkamp , Giulio Ermanno Pibiri



PDF
Thumbnail PDF

File

LIPIcs.WABI.2024.11.pdf
  • Filesize: 2.07 MB
  • 23 pages

Document Identifiers

Author Details

Ragnar Groot Koerkamp
  • ETH Zurich, Switzerland
Giulio Ermanno Pibiri
  • Ca' Foscari University of Venice, Italy
  • ISTI-CNR, Pisa, Italy

Cite AsGet BibTex

Ragnar Groot Koerkamp and Giulio Ermanno Pibiri. The mod-minimizer: A Simple and Efficient Sampling Algorithm for Long k-mers. In 24th International Workshop on Algorithms in Bioinformatics (WABI 2024). Leibniz International Proceedings in Informatics (LIPIcs), Volume 312, pp. 11:1-11:23, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2024)
https://doi.org/10.4230/LIPIcs.WABI.2024.11

Abstract

Motivation. Given a string S, a minimizer scheme is an algorithm defined by a triple (k,w,𝒪) that samples a subset of k-mers (k-long substrings) from a string S. Specifically, it samples the smallest k-mer according to the order 𝒪 from each window of w consecutive k-mers in S. Because consecutive windows can sample the same k-mer, the set of the sampled k-mers is typically much smaller than S. More generally, we consider substring sampling algorithms that respect a window guarantee: at least one k-mer must be sampled from every window of w consecutive k-mers. As a sampled k-mer is uniquely identified by its absolute position in S, we can define the density of a sampling algorithm as the fraction of distinct sampled positions. Good methods have low density which, by respecting the window guarantee, is lower bounded by 1/w. It is however difficult to design a sequence-agnostic algorithm with provably optimal density. In practice, the order 𝒪 is usually implemented using a pseudo-random hash function to obtain the so-called random minimizer. This scheme is simple to implement, very fast to compute even in streaming fashion, and easy to analyze. However, its density is almost a factor of 2 away from the lower bound for large windows. Methods. In this work we introduce mod-sampling, a two-step sampling algorithm to obtain new minimizer schemes. Given a (small) parameter t, the mod-sampling algorithm finds the position p of the smallest t-mer in a window. It then samples the k-mer at position pod w. The lr-minimizer uses t = k-w and the mod-minimizer uses t≡ k (mod w). Results. These new schemes have provably lower density than random minimizers and other schemes when k is large compared to w, while being as fast to compute. Importantly, the mod-minimizer achieves optimal density when k → ∞. Although the mod-minimizer is not the first method to achieve optimal density for large k, its proof of optimality is simpler than previous work. We provide pseudocode for a number of other methods and compare to them. In practice, the mod-minimizer has considerably lower density than the random minimizer and other state-of-the-art methods, like closed syncmers and miniception, when k > w. We plugged the mod-minimizer into SSHash, a k-mer dictionary based on minimizers. For default parameters (w,k) = (11,21), space usage decreases by 15% when indexing the whole human genome (GRCh38), while maintaining its fast query time.

Subject Classification

ACM Subject Classification
  • Theory of computation → Sketching and sampling
  • Applied computing → Bioinformatics
Keywords
  • Minimizers
  • Randomized algorithms
  • Sketching
  • Hashing

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Djamal Belazzougui, Fabiano C. Botelho, and Martin Dietzfelbinger. Hash, displace, and compress. In Amos Fiat and Peter Sanders, editors, Algorithms - ESA 2009, 17th Annual European Symposium, Copenhagen, Denmark, September 7-9, 2009. Proceedings, volume 5757 of Lecture Notes in Computer Science, pages 682-693. Springer, 2009. URL: https://doi.org/10.1007/978-3-642-04128-0_61.
  2. Rayan Chikhi, Antoine Limasset, and Paul Medvedev. Compacting de Bruijn graphs from sequencing data quickly and in low memory. Bioinformatics, 32(12):201-208, 2016. URL: https://doi.org/10.1093/bioinformatics/btw279.
  3. Andrea Cracco and Alexandru I. Tomescu. Extremely fast construction and querying of compacted and colored de Bruijn graphs with GGCAT. Genome Research, 33:1198-1207, 2023. URL: https://doi.org/10.1101/gr.277615.122.
  4. Robert Edgar. Syncmers are more sensitive than minimizers for selecting conserved k-mers in biological sequences. PeerJ, 9, 2021. URL: https://doi.org/10.7717/peerj.10805.
  5. Barış Ekim, Bonnie Berger, and Rayan Chikhi. Minimizer-space de Bruijn graphs: Whole-genome assembly of long reads in minutes on a personal computer. Cell systems, 12(10):958-968, 2021. URL: https://doi.org/10.1016/j.cels.2021.08.009.
  6. Barış Ekim, Bonnie Berger, and Yaron Orenstein. A randomized parallel algorithm for efficiently finding near-optimal universal hitting sets. In Russell Schwartz, editor, Research in Computational Molecular Biology - 24th Annual International Conference, RECOMB 2020, Padua, Italy, May 10-13, 2020, Proceedings, volume 12074 of Lecture Notes in Computer Science, pages 37-53. Springer, Springer, 2020. URL: https://doi.org/10.1007/978-3-030-45257-5_3.
  7. Jason Fan, Jamshed Khan, Noor Pratap Singh, Giulio Ermanno Pibiri, and Rob Patro. Fulgor: A fast and compact k-mer index for large-scale matching and color queries. Algorithms for Molecular Biology, 19(1):1-21, 2024. URL: https://doi.org/10.1186/s13015-024-00251-9.
  8. Bryce Kille, Erik Garrison, Todd J. Treangen, and Adam M. Phillippy. Minmers are a generalization of minimizers that enable unbiased local jaccard estimation. Bioinformatics, 39(9):btad512, 2023. URL: https://doi.org/10.1093/bioinformatics/btad512.
  9. Grigorios Loukides and Solon P. Pissis. Bidirectional string anchors: A new string sampling mechanism. In Petra Mutzel, Rasmus Pagh, and Grzegorz Herman, editors, 29th Annual European Symposium on Algorithms, ESA 2021, September 6-8, 2021, Lisbon, Portugal (Virtual Conference), volume 204 of LIPIcs, pages 64:1-64:21, Dagstuhl, Germany, 2021. Schloss Dagstuhl - Leibniz-Zentrum für Informatik. URL: https://doi.org/10.4230/LIPIcs.ESA.2021.64.
  10. Grigorios Loukides, Solon P. Pissis, and Michelle Sweering. Bidirectional string anchors for improved text indexing and top-k similarity search. IEEE Transactions on Knowledge and Data Engineering, 35(11):11093-11111, November 2023. URL: https://doi.org/10.1109/TKDE.2022.3231780.
  11. Guillaume Marçais, Dan F. DeBlasio, and Carl Kingsford. Asymptotically optimal minimizers schemes. Bioinformatics, 34(13):i13-i22, 2018. URL: https://doi.org/10.1093/bioinformatics/bty258.
  12. Camille Marchet, Maël Kerbiriou, and Antoine Limasset. BLight: Efficient exact associative structure for k-mers. Bioinformatics, 37(18):2858-2865, 2021. URL: https://doi.org/10.1093/bioinformatics/btab217.
  13. Hamid Mohamadi, Justin Chu, Benjamin P. Vandervalk, and Inanc Birol. ntHash: recursive nucleotide hashing. Bioinformatics, 32(22):3492-3494, 2016. URL: https://doi.org/10.1093/bioinformatics/btw397.
  14. Johannes Mykkeltveit. A proof of Golomb’s conjecture for the de Bruijn graph. Journal of Combinatorial Theory, Series B, 13(1):40-45, 1972. URL: https://doi.org/10.1016/0095-8956(72)90006-8.
  15. Yaron Orenstein, David Pellow, Guillaume Marçais, Ron Shamir, and Carl Kingsford. Compact universal k-mer hitting sets. In Martin C. Frith and Christian Nørgaard Storm Pedersen, editors, Algorithms in Bioinformatics - 16th International Workshop, WABI 2016, Aarhus, Denmark, August 22-24, 2016. Proceedings, volume 9838 of Lecture Notes in Computer Science, pages 257-268. Springer, Springer, 2016. URL: https://doi.org/10.1007/978-3-319-43681-4_21.
  16. Yaron Orenstein, David Pellow, Guillaume Marçais, Ron Shamir, and Carl Kingsford. Designing small universal k-mer hitting sets for improved analysis of high-throughput sequencing. PLoS computational biology, 13(10):e1005777, 2017. URL: https://doi.org/10.1371/journal.pcbi.1005777.
  17. David Pellow, Lianrong Pu, Bariş Ekim, Lior Kotlar, Bonnie Berger, Ron Shamir, and Yaron Orenstein. Efficient minimizer orders for large values of k using minimum decycling sets. Genome Research, 33(7):1154-1161, 2023. URL: https://doi.org/10.1101/gr.277644.123.
  18. Giulio Ermanno Pibiri. Sparse and skew hashing of k-mers. Bioinformatics, 38:i185-i194, 2022. URL: https://doi.org/10.1093/bioinformatics/btac245.
  19. Giulio Ermanno Pibiri. On weighted k-mer dictionaries. Algorithms for Molecular Biology, 18(1):1-20, 2023. URL: https://doi.org/10.1186/s13015-023-00226-2.
  20. Giulio Ermanno Pibiri, Jason Fan, and Rob Patro. Meta-colored compacted de Bruijn graphs. In Jian Ma, editor, Research in Computational Molecular Biology - 28th Annual International Conference, RECOMB 2024, Cambridge, MA, USA, April 29 - May 2, 2024, Proceedings, volume 14758 of Lecture Notes in Computer Science, pages 131-146. Springer, 2024. URL: https://doi.org/10.1007/978-1-0716-3989-4_9.
  21. Giulio Ermanno Pibiri and Roberto Trani. Parallel and external-memory construction of minimal perfect hash functions with pthash. IEEE Transactions on Knowledge and Data Engineering, 36(3):1249-1259, 2024. URL: https://doi.org/10.1109/TKDE.2023.3303341.
  22. Geoff Pike and Jyrki Alakuijala. CityHash. https://github.com/aappleby/smhasher/blob/master/src/City.cpp, 2011.
  23. Michael Roberts, Wayne B. Hayes, Brian R. Hunt, Stephen M. Mount, and James A. Yorke. Reducing storage requirements for biological sequence comparison. Bioinformatics, 20(18):3363-3369, 2004. URL: https://doi.org/10.1093/bioinformatics/bth408.
  24. Mirko Rossi, Mickael Santos Da Silva, Bruno Filipe Ribeiro-Gonçalves, Diogo Nuno Silva, Miguel Paulo Machado, Mónica Oleastro, Vítor Borges, Joana Isidro, Luis Viera, Jani Halkilahti, Anniina Jaakkonen, Federica Palma, Saara Salmenlinna, Marjaana Hakkinen, Javier Garaizar, Joseba Bikandi, Friederike Hilbert, and João André Carriço. INNUENDO whole genome and core genome MLST schemas and datasets for Salmonella enterica. Zenodo, July 2018. Google Scholar
  25. Kristoffer Sahlin. Effective sequence similarity detection with strobemers. Genome research, 31(11):2080-2094, 2021. URL: https://doi.org/10.1101/gr.275648.121.
  26. Saul Schleimer, Daniel Shawcross Wilkerson, and Alexander Aiken. Winnowing: local algorithms for document fingerprinting. In Alon Y. Halevy, Zachary G. Ives, and AnHai Doan, editors, Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, San Diego, California, USA, June 9-12, 2003, pages 76-85. ACM, 2003. URL: https://doi.org/10.1145/872757.872770.
  27. Hongyu Zheng, Carl Kingsford, and Guillaume Marçais. Improved design and analysis of practical minimizers. Bioinformatics, 36:i119-i127, 2020. URL: https://doi.org/10.1093/bioinformatics/btaa472.
  28. Hongyu Zheng, Carl Kingsford, and Guillaume Marçais. Sequence-specific minimizers via polar sets. Bioinformatics, 37:187-195, 2021. URL: https://doi.org/10.1093/bioinformatics/btab313.
  29. Hongyu Zheng, Guillaume Marçais, and Carl Kingsford. Creating and using minimizer sketches in computational genomics. Journal of Computational Biology, 30(12):1251-1276, 2023. URL: https://doi.org/10.1089/cmb.2023.0094.
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail