Motivation. Given a string S, a minimizer scheme is an algorithm defined by a triple (k,w,𝒪) that samples a subset of k-mers (k-long substrings) from a string S. Specifically, it samples the smallest k-mer according to the order 𝒪 from each window of w consecutive k-mers in S. Because consecutive windows can sample the same k-mer, the set of the sampled k-mers is typically much smaller than S. More generally, we consider substring sampling algorithms that respect a window guarantee: at least one k-mer must be sampled from every window of w consecutive k-mers. As a sampled k-mer is uniquely identified by its absolute position in S, we can define the density of a sampling algorithm as the fraction of distinct sampled positions. Good methods have low density which, by respecting the window guarantee, is lower bounded by 1/w. It is however difficult to design a sequence-agnostic algorithm with provably optimal density. In practice, the order 𝒪 is usually implemented using a pseudo-random hash function to obtain the so-called random minimizer. This scheme is simple to implement, very fast to compute even in streaming fashion, and easy to analyze. However, its density is almost a factor of 2 away from the lower bound for large windows. Methods. In this work we introduce mod-sampling, a two-step sampling algorithm to obtain new minimizer schemes. Given a (small) parameter t, the mod-sampling algorithm finds the position p of the smallest t-mer in a window. It then samples the k-mer at position pod w. The lr-minimizer uses t = k-w and the mod-minimizer uses t≡ k (mod w). Results. These new schemes have provably lower density than random minimizers and other schemes when k is large compared to w, while being as fast to compute. Importantly, the mod-minimizer achieves optimal density when k → ∞. Although the mod-minimizer is not the first method to achieve optimal density for large k, its proof of optimality is simpler than previous work. We provide pseudocode for a number of other methods and compare to them. In practice, the mod-minimizer has considerably lower density than the random minimizer and other state-of-the-art methods, like closed syncmers and miniception, when k > w. We plugged the mod-minimizer into SSHash, a k-mer dictionary based on minimizers. For default parameters (w,k) = (11,21), space usage decreases by 15% when indexing the whole human genome (GRCh38), while maintaining its fast query time.
@InProceedings{grootkoerkamp_et_al:LIPIcs.WABI.2024.11, author = {Groot Koerkamp, Ragnar and Pibiri, Giulio Ermanno}, title = {{The mod-minimizer: A Simple and Efficient Sampling Algorithm for Long k-mers}}, booktitle = {24th International Workshop on Algorithms in Bioinformatics (WABI 2024)}, pages = {11:1--11:23}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-340-9}, ISSN = {1868-8969}, year = {2024}, volume = {312}, editor = {Pissis, Solon P. and Sung, Wing-Kin}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2024.11}, URN = {urn:nbn:de:0030-drops-206552}, doi = {10.4230/LIPIcs.WABI.2024.11}, annote = {Keywords: Minimizers, Randomized algorithms, Sketching, Hashing} }
Feedback for Dagstuhl Publishing