eng
Schloss Dagstuhl β Leibniz-Zentrum fΓΌr Informatik
Leibniz International Proceedings in Informatics
1868-8969
2024-08-26
11:1
11:23
10.4230/LIPIcs.WABI.2024.11
article
The mod-minimizer: A Simple and Efficient Sampling Algorithm for Long k-mers
Groot Koerkamp, Ragnar
1
https://orcid.org/0000-0002-2091-1237
Pibiri, Giulio Ermanno
2
3
https://orcid.org/0000-0003-0724-7092
ETH Zurich, Switzerland
Ca' Foscari University of Venice, Italy
ISTI-CNR, Pisa, Italy
Motivation. Given a string S, a minimizer scheme is an algorithm defined by a triple (k,w,πͺ) that samples a subset of k-mers (k-long substrings) from a string S. Specifically, it samples the smallest k-mer according to the order πͺ from each window of w consecutive k-mers in S. Because consecutive windows can sample the same k-mer, the set of the sampled k-mers is typically much smaller than S. More generally, we consider substring sampling algorithms that respect a window guarantee: at least one k-mer must be sampled from every window of w consecutive k-mers. As a sampled k-mer is uniquely identified by its absolute position in S, we can define the density of a sampling algorithm as the fraction of distinct sampled positions. Good methods have low density which, by respecting the window guarantee, is lower bounded by 1/w. It is however difficult to design a sequence-agnostic algorithm with provably optimal density. In practice, the order πͺ is usually implemented using a pseudo-random hash function to obtain the so-called random minimizer. This scheme is simple to implement, very fast to compute even in streaming fashion, and easy to analyze. However, its density is almost a factor of 2 away from the lower bound for large windows.
Methods. In this work we introduce mod-sampling, a two-step sampling algorithm to obtain new minimizer schemes. Given a (small) parameter t, the mod-sampling algorithm finds the position p of the smallest t-mer in a window. It then samples the k-mer at position pod w. The lr-minimizer uses t = k-w and the mod-minimizer uses tβ‘ k (mod w).
Results. These new schemes have provably lower density than random minimizers and other schemes when k is large compared to w, while being as fast to compute. Importantly, the mod-minimizer achieves optimal density when k β β. Although the mod-minimizer is not the first method to achieve optimal density for large k, its proof of optimality is simpler than previous work.
We provide pseudocode for a number of other methods and compare to them. In practice, the mod-minimizer has considerably lower density than the random minimizer and other state-of-the-art methods, like closed syncmers and miniception, when k > w. We plugged the mod-minimizer into SSHash, a k-mer dictionary based on minimizers. For default parameters (w,k) = (11,21), space usage decreases by 15% when indexing the whole human genome (GRCh38), while maintaining its fast query time.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol312-wabi2024/LIPIcs.WABI.2024.11/LIPIcs.WABI.2024.11.pdf
Minimizers
Randomized algorithms
Sketching
Hashing