SimdMinimizers: Computing Random Minimizers, fast
Abstract
Motivation.
Because of the rapidly-growing amount of sequencing data, computing sketches of large textual datasets has become an essential preprocessing task. These sketches are typically much smaller than the input sequences, but preserve sufficient information for downstream analysis. Minimizers are an especially popular sketching technique and used in a wide variety of applications. They sample at least one out of every consecutive -mers. As DNA sequencers are getting more accurate, some applications can afford to use a larger and hence sparser and smaller sketches. And as sketches get smaller, their analysis becomes faster, so the time spent sketching the full-sized input becomes more of a bottleneck.
Methods.
Our library simd-minimizers implements a random minimizer algorithm using SIMD instructions. It supports both AVX2 and NEON architectures. Its main novelty is two-fold. First, it splits the input into 8 chunks that are streamed over in parallel through all steps of the algorithm. This is enabled by using the completely deterministic two-stacks sliding window minimum algorithm, which seems not to have been used before for finding minimizers.
Results.
Our library is up to faster than a scalar implementation of the rescan method when is small, and faster for larger . Computing canonical minimizers is less than slower than computing forward minimizers, and over faster than the existing implementation in the minimizer-iter crate. Our library finds all (canonical) minimizers of a 3.2 Gbp human genome in 5.2 (resp. 6.7) seconds.
Keywords and phrases:
Minimizers, Randomized algorithms, Sketching, HashingFunding:
Ragnar Groot Koerkamp: ETH Research Grant ETH-1721-1 to Gunnar Rätsch.Copyright and License:
2012 ACM Subject Classification:
Theory of computation Sketching and sampling ; Applied computing BioinformaticsSupplementary Material:
Software: https://github.com/rust-seq/simd-minimizers [9]archived at
swh:1:dir:d2fbef4c7750f02bc5cb084d0873078fef7f4e22
archived at
swh:1:dir:181f8591389fdf9b941ac6f225a26d2784514787
Acknowledgements:
We thank the anonymous reviewers for their comments that lead to a significantly improved presentation, as well as for their suggestion to apply a matrix transpose to the input, which improved performances.Editors:
Petra Mutzel and Nicola PrezzaSeries and Publisher:
Leibniz International Proceedings in Informatics, Schloss Dagstuhl – Leibniz-Zentrum für Informatik
1 Introduction
Minimizers were simultaneously introduced by [31] and [30] as a method to sample short strings of fixed length , called -mers or -grams, for the purpose of fingerprinting and comparing large textual documents such as genomic sequences. This sampling method plays a central role in bioinformatics for the high-throughput analysis of DNA sequencing data and is a fundamental building block for many related tasks such as indexing [29, 21], counting [3, 23], aligning [19, 14], or assembling [4, 1] genomic sequences.
Minimizers are defined as follows: given a window of consecutive -mers, the minimizer of is the smallest -mer according to some order. In practice, this order is often pseudo-random by hashing the -mers, leading to the random minimizer. The density of a minimizer scheme is the expected fraction of sampled -mers on a sufficiently long random string. When is not too small, random minimizers have an expected density close to , which is around twice as much as the lower bound of . In recent years, there have been a number of papers on methods with lower density than random minimizers [35, 28, 11, 7, 6]. While these contributions have narrowed the gap to an optimal-density sampling scheme [17], none of them focused on improving the computation time of minimizers.
In bioinformatics applications, the -mer length is usually at most , so
that each -mer can be represented by a single u64 machine word, and the
window length can be as low as (where around a third of the -mers is
sampled), but is typically between and . Longer windows of size up to
are also possible to index highly conserved texts.
Problem statement.
We aim to solve the following problem as fast as possible: given a bitpacked representation of a sequence of ACGT DNA characters, compute the positions of all (canonical) random minimizers.
Contributions.
This work introduces a carefully optimized algorithm to compute the minimizers of a genomic sequence. Conceptually, it consists of two parts. First, we introduce a scalar algorithm (Section 3) that uses time and space. Most of this can be trivially parallelized to independent lanes with AVX2 or NEON SIMD instructions, and in Section 4 we specifically handle the input and output. The parts we discuss are:
-
1.
We apply ntHash [25, 16], a pseudo-random rolling hash function for -mers (Section 3.2).
-
2.
We compute the sliding-window minima of the hashes using the two-stacks method [12, 32] (Section 3.3).
-
3.
We compute canonical minimizers based on refined minimizers [27] to decide the strand of each window (Section 3.4).
-
4.
We extend the scalar algorithm to SIMD by using it on chunks of the (bitpacked) input sequence in parallel (Section 4.1).
-
5.
Lastly, we collect and deduplicate the parallel streams into unique minimizer positions (Section 4.2).
Results.
Our method is (for large ) to (for small ) times faster than the fastest non-SIMD algorithm for computing forward minimizers. For canonical minimizers, we only compare against a simple implementation and find over speedup. As a result, we can compute the minimizers of a human genome in 5.2 seconds, and the canonical minimizers in 6.7 seconds. We also adapt our method to support generic plain-text ASCII input (), which is slightly (30%) slower due to the larger input characters.
Software.
A Rust implementation of our method is publicly available at https://github.com/rust-seq/simd-minimizers. The packed sequence representation, the splitting into chunks, and the parallel iteration over their bases is extracted to a separate library, packed-seq, available at https://github.com/rust-seq/packed-seq.
2 Preliminaries
Bitpacking.
In this work, we assume that the input sequence is over the DNA alphabet
and that each letter is
encoded using two bits: .
This encoding can easily be obtained from the
ASCII representation by applying a mask: (c >> 1) & 3. Additionally, we
assume that the whole sequence is bitpacked using this 2-bit encoding, which
can be done as a preprocessing step on the input if necessary. Non-ACGT
characters have to be handled during this preprocessing as well, and could be
skipped, converted to A, or the input sequence could be split at these points.
We assume that the hardware is little-endian, and that the integer value of a
sequence is given by .
Minimizers.
Given parameters and , a window of length contains consecutive -mers. The minimizer of the window is the smallest -mer in the window. For random minimizers, -mers are ordered by a pseudo-random order, usually given by comparing hashes of the -mers. In case of ties, the leftmost smallest -mer is chosen.
Our goal is to compute the absolute position of the minimizer of every window in the input text. Since adjacent windows often have the same -mer as minimizer, we only want each position to be listed once in the output.
Canonical minimizers.
Because DNA is double-stranded and most sequencing technologies do not distinguish these two strands, genomic sequences have an additional constraint: a sequence and its reverse-complement (the reversed sequence of complementary bases and ) should be considered identical. To satisfy this constraint, canonical minimizers should return the same set of -mers regardless of the strandedness of the input. Specifically, if the canonical minimizer of a window is at position , then the canonical minimizer of the reverse-complement of should be at position . In practice, canonical minimizers are often overlooked as an implementation detail and most existing methods simply compare canonical -mers, computed as , which gives a weaker guarantee [24].
3 A predictable scalar algorithm
Overview.
Computing the minimizers of a sequence generally involves a few steps, as shown in Figure 1. First, the -mers are hashed. We do this using ntHash, a rolling hash (Section 3.2). Next, in each window of -mer hashes, we must find the position of the leftmost -mer with the smallest hash. For this, we use a method based on the two-stacks algorithm (Section 3.3). Lastly, many adjacent windows will have the same minimizer, and thus these positions must be deduplicated.
In this section, we present a scalar algorithm (that will be trivial to parallelize later by using SIMD instructions Section 4), and we also introduce a variant to compute canonical minimizers in Section 3.4.
The rolling hash takes constant time per character, and our sliding window algorithm will also take amortized constant time per character, so that the entire method runs in . It needs space to store the hashes of the current window. With the SIMD optimizations, this improves to time and space, where is the number of SIMD lanes.
3.1 Iterating Packed Input
As input to our algorithm, we use a 2-bit packed sequence representation. Our representation is little-endian, in that the two least significant bits of each 8-bit or 32-bit value correspond to the leftmost encoded base.
Extracting bases.
To iterate the bases of the input string, we process them one u32 of 16 bases at a time. For each of those, we do 16 iterations where we extract the 2 least significant bits to return, and then shift the remainder down by 2 bits.
For 8-bit ASCII input, instead, the low 8 bits can be extracted and then shifted away.
Delayed iterator.
In order to support a rolling hash, we simply iterate the characters twice in parallel, with an offset: once for the character entering each -mer and once delayed by iterations for the character leaving each -mer. Similarly, for canonical minimizers we will also need the character leaving the window, for which we do a third iteration delayed by steps. To avoid additional reads from memory, we store u32 values read from memory in a small ring buffer.
3.2 Rolling Hash
The first step of the algorithm is the computation of a hash for each -mer. We adapt ntHash [25, 16], a popular rolling hash function for DNA sequences that is based on cyclic polynomials [2] as an improvement over Karp-Rabin hashing [15]. This rolling hash function can be summarized as follows: each of the 4 bases is associated to a fixed 32-bit random value, denoted , and the hash of a -mer is computed as , where denotes a cyclic rotation by bits to the left and denotes xor. In practice, each is a value in and is a simple table lookup, as shown in Algorithm 2.
Using a rolling hash has a twofold advantage in our use case: first, we only need the first and last bases of each -mer to update its hash based on the previous one, so that we do not have to store the -mer itself, and second, we are not limited to -mers that fit in a constant number of words.
Most of the pseudocode in Algorithm 2 can be trivially adapted to
work on u32x8 SIMD registers containing 8 32-bit values.
The table_lookup function computes the
values of of the nucleotides stored in in_bp or out_bp
by looking up their value in table_in or table_out.
It can be implemented using _mm256_permutevar_ps in AVX2 or
vqtbl1q_u8 in NEON.
MulHash.
A drawback of ntHash is that the efficient SIMD table lookup only works because it uses an alphabet of size . This means it does not work for general ASCII input. We introduce an alternative that we call mulHash, which replaces the table lookup of ntHash by , where is a fixed random constant and the multiplication is wrapping over 32-bit integers. Although slightly slower, this can be easily computed for any input value .
3.3 Sliding Window Minimum
The second step of our algorithm computes the position of the minimum -mer hash in each sliding window of hashes. A number of different approaches can be used for this, and pseudocode for each of the methods discussed can be found in Algorithm 3.
Naive.
The simplest approach is to simply loop over the values in each window independently. This takes time, but can still be quite efficient when is small by using vectorized instructions.
Monotone queue.
An approach with better complexity is to use a monotone queue, which stores a non-decreasing subsequence of the hashes, alongside their positions. Every time the window slides one to the right and we are about to push a new -mer hash onto the right of the queue, we first remove any values larger than it, as they are “shadowed” by the new hash and can never be minimal anymore. The minimum of the window is then always the leftmost queue element. This data structure guarantees an amortized constant time update, but has many unpredictable branches due to removing between and values, which makes it costly in practice.
Rescan.
Another approach used in bioinformatics is to only keep track of the minimum value and rescan the entire window of values when the current minimum goes out of scope [19, 20]. While this algorithm does not guarantee a worst-case constant time update, it only branches when the minimum goes out of scope and hence is more predictable. This makes it more efficient in practice, especially since minimizers typically have a density of so that the rescan step takes amortized constant time per element.
To the best of our knowledge, most existing methods in bioinformatics use either a monotone queue or a rescan approach [13, 19, 20].
Two-stacks.
Since our goal here is to compute minima at the same time using vectorized instructions, we want to avoid any kind of data-dependent branches to ensure that the code path is the same for each chunk. A method for online sliding minima where elements may be added and removed at varying rates is the two-stacks method [12, 32], that is well-known in the competitive programming community111https://codeforces.com/blog/entry/71687. Here, we only discuss a version where the number of elements remains constant at .
Conceptually, we first split the sequence of input -mer hashes into blocks of size , as shown in Figure 2. Then, we can compute both prefix minima and suffix minima of each block in per block, or amortized per input hash. Now, any window of size can be split into a suffix of the previous block and a prefix of the current block (as highlighted in Figure 2), and we can return the minimum of the two corresponding suffix/prefix minima. When the window exactly coincides with a block (as shown on the right), the suffix and prefix minimum are equal.
In the implementation, Algorithm 4, the prefix minima are simply computed incrementally, while the suffix-minima are computed in batches after every block of hashes has been filled. This way, only a single buffer of size is needed, and the two cases above are unified.
The only branch in the algorithm triggers exactly every iterations, and is thus completely data-independent. Thus, branches are both highly predictable, and the same across all lanes.
16-bit hashes.
To easily return the leftmost position of the minimum, instead of the minimum itself, we only use the upper 16 bits of each hash value, and store the position in the lower 16 bits. After taking the minimum, we mask out the high bits to obtain its position. Strings longer than characters can be processed in chunks of . We note here that while using a 16-bit hash is usually not sufficient for -mer indexing purposes, this is sufficiently good for selecting the minimum of a window when , which is indeed the case in bioinformatics applications, where usually . Our library intentionally does not expose this hash to the user. Instead, a second, larger, hash can be computed afterwards for indexing purposes if needed.
3.4 Canonical Minimizers
One problem that arises when using minimizers in practice is that the DNA strand is often unknown. Thus, we do not know whether we are reading the forward (sense) or reverse-complement (antisense) strand, where the sequence is reversed and complementary bases and are used. When given a sequence, we would like to select the same minimizers in a strand-agnostic manner.
Canonical strand.
Following [27], we define the canonical strand for each window of length as the strand where the count of GT bases (encoded values and ) is the highest, as shown in Algorithm 5. (In fact, any pair of bases can be chosen, as long as they are not complementary.) This count is in and when is odd there can be no tie between the two strands. In code, we instead compute the more symmetric , which is in , so that count defines the canonical strand. Then, we select the leftmost forward minimizer when the input window is canonical, and the rightmost reverse-complement minimizer when the input window is not canonical.
The benefit of this method over, say, determining the strand via the middle character (assuming again that is odd), is that the GT count is more stable across consecutive windows, since it varies by . This way, the strandedness and thus the chosen minimizer is less likely to flip between adjacent windows.
Canonical ntHash.
NtHash can be easily modified to compute both the forward and reverse-complement hash of each -mer at the same time. Then, we can duplicate the sliding minimum algorithm to find the minimum of and the maximum of over each window. This way, ties in the forward direction are broken towards small and ties in the reverse-complement direction are broken towards large . To avoid selecting a new minimizer whenever the strand changes, we use a strand-independent canonical version of ntHash for both strands, defined as [16], and we do not distinguish which strand a minimizer -mer was chosen from.
While this process does not assume to be odd, this may be a useful additional assumption for further processing of the minimizers, so that the canonical representation of each -mer itself can be indexed. Alternatively, -mers could all be processed in the direction of the strand of the window they minimize, but this requires additional bookkeeping to store this direction, and would require duplicating -mers when they minimize both forward and reverse-complement strands.
Canonical minimizers are not forward.
One small drawback of this scheme is that it is not forward. Suppose a long window has many occurrences of the smallest minimizer, and that shifting the window one position changes its canonical strand. Then, the position of the sampled minimizer could jump backwards: from sampling the rightmost minimizer to sampling the leftmost minimizer. In practice, this does not seem to be a major limitation, both because it is rare and because downstream methods usually work fine on non-forward schemes anyway.
4 A SIMD algorithm
SIMD.
So far, our algorithms for iterating the input bases (Algorithm 1), hashing -mers (Algorithm 2), and computing sliding window minima (Algorithm 4) are completely scalar. Now, we would like to use SIMD instructions to speed them up. With 256-bit AVX2 instructions, for example, we can process lanes of 32-bit values at a time. A first approach could be to use this to compute the hash of consecutive minimizer values at the same time, and then to compute the minimizer of the new windows all at once. Unfortunately, this is tricky due to the sequential nature of rolling hashing and sliding window minima.
Chunks.
Instead of processing consecutive -mers in parallel, we choose to split the input sequence into equally long chunks that we process in parallel. This way, we compute one hash of each chunk in parallel, and then compute one minimizer position of a window of each chunk in parallel as well. This works, because the computations on the chunks are completely independent of each other. We let adjacent chunks overlap by characters, so that each window is fully contained in exactly one chunk. The total number of windows may not be divisible by . In that case, we round up the chunk length and return the number of elements that was added as padding, so that these can be removed later. For simplicity, we omit that case from the code snippets.
4.1 Gathering the input
The parts that need special attention for SIMD instructions are the parallel reading of the input sequence, and the parallel deduplicating of the output minimizer positions (Section 4.2).
The first step is to read the actual bases.
Since memory access instructions are relatively slow222https://uops.info/html-instr/VPGATHERDD_YMM_VSIB_YMM_YMM.html compared to the SIMD
operations we do on them, we cannot afford to read from each chunk one base or one byte at a time.
We could read 32 bits at a time from each chunk using SIMD
gather instructions, but these are relatively slow.
Instead333As suggested by one of the anonymous reviewers, for which we
thank them., we read a full SIMD register of 256 bits worth of data at a time
from each chunk, as shown in Algorithm 6. Then, we transpose444https://stackoverflow.com/questions/25622745/transpose-an-8x8-float-using-avx-avx2555https://github.com/rust-seq/simd-minimizers/blob/master/simd-minimizers/src/intrinsics/transpose.rs
this matrix
so that we obtain
SIMD registers, where the first contains the upcoming u32
for each chunk, the second contains the next u32 for each lane, and so on.
Then, we use this transposed matrix as a buffer, and use it for the next 128 bases. After
each 16 bases, we shift to the next u32x8 from it.
4.2 Collecting and Deduplicating Positions
Transpose.
The last step of the minimizer algorithm is to deduplicate the results, since adjacent windows often share the same minimizer position. In practice, deduplicating works best when the data to be duplicated is linear in memory. But the output of the vectorized sliding-window-minimum gives a u32x8 containing the position of one minimizer of each chunk. Thus, every iterations we reuse the matrix transpose to obtain a u32x8 for each chunk, containing consecutive minimizer positions666https://github.com/rust-seq/simd-minimizers/blob/master/simd-minimizers/src/collect.rs.
Dedup.
We deduplicate each lane using the technique of [18]. This compares each element to the previous one, and compares the first element to the last minimizer of the window before. The distinct elements are then shuffled to the front of the SIMD vector using a lookup table and appended to a buffer for each lane (Figure 4). We end by concatenating all the per-lane buffers into a single vector of minimizer positions, and make sure to avoid duplicates between the end and start of adjacent lanes.
Super-k-mers.
The deduplication can be amended to also find super--mers, which are sequences of consecutive windows sharing the same minimizer position. After comparing adjacent minimizer positions, we obtain a mask that determines the shuffle instruction to apply. Normally we shuffle the 32-bit minimizer positions directly. Instead, we can mask out the upper 16 bits and store there the index of its window. If we then shuffle those values, we obtain for each minimizer its position in the input text, and the position of the first window where this -mer became a minimizer. This information is sufficient to recover all super--mers, and sequences longer than bp can be either split-up, or else it is easy to detect manually when the values wrapped.
| Part | ns/bp |
|---|---|
| Iterate the bases | 0.15 |
| + collect to vector | 0.30 |
| + iterate the delayed bases | 0.30 |
| + ntHash | 0.32 |
| + sliding window min | 0.90 |
| + collect | 1.48 |
| + dedup | 1.61 |
| + canonical nthash | 1.04 |
| + canonical strand | 1.53 |
| + collect | 2.03 |
| + dedup | 2.20 |
| : | ||||||
|---|---|---|---|---|---|---|
| Method | fwd. | cano. | fwd. | cano. | fwd. | cano. |
| minimizer-iter | 25.30 | 32.84 | 26.96 | 33.93 | 26.81 | 34.04 |
| Rescan ntHash | 11.65 | – | 7.41 | – | 5.61 | – |
| simd-minimizers ntHash | ||||||
| – packed input | 1.69 | 2.28 | 1.61 | 2.20 | 1.64 | 2.16 |
| – on-the-fly packing | 1.92 | 2.50 | 1.84 | 2.42 | 1.91 | 2.42 |
| Rescan mulHash | 11.37 | – | 6.79 | – | 5.76 | – |
| simd-minimizers mulHash | ||||||
| – packed input | 1.85 | 2.49 | 1.74 | 2.40 | 1.78 | 2.42 |
| – on-the-fly packing | 2.12 | 2.70 | 2.05 | 2.62 | 2.05 | 2.65 |
| – ASCII input | 2.11 | 2.71 | 2.06 | 2.63 | 2.01 | 2.66 |
5 Experimental Evaluation
Our code is available in the simd-minimizers crate that can be found at https://github.com/rust-seq/simd-minimizers. Part of the code was extracted into a separate library, packed-seq, for easy reuse in other projects, which is available at https://github.com/rust-seq/packed-seq. Both libraries support both AVX2 and NEON instruction sets, and we will now look at their performance. The code to reproduce the experiments is available in the simd-minimizers-bench subdirectory.
The experiments were run on an Intel Core i7-10750H with 6 cores with AVX2 running at a fixed frequency of 2.6GHz with hyperthreading disabled and cache sizes of 32 KiB (L1), 256 KiB (L2), and 12 MiB shared L3. Code is compiled with rustc 1.88.0-nightly777We have made sure that all layers of iterators are inlined into single function. This is usually needed for optimal performance. Even then, small changes to the generated code (by changing the source, or compiler version) can impact performance by as much as 20%.. As input, we use a fixed random string of bases, depending on the method encoded either as ASCII or as packed representation. Reported timings are the median of five runs and shown in nanoseconds per base.
In our experiments, we use parameter values for and as used by Kraken2 [33] , SSHash [29] , and Minimap2 [19] .
Tables 3 and 4 in Appendix A show equivalent results for the NEON architecture.
Incremental time usage.
Table 1 shows the time usage for various incremental subsets of our method. To start, iterating the chunks of the input and summing all bases takes 0.15 ns/bp. Appending all u32x8 SIMD vectors containing the bases to a vector takes 0.30 ns/bp, indicating that writing to memory induces some overhead. Collecting the second -delayed stream of characters that leave the -mer (by adding them to the non-delayed stream) has no additional overhead. Computing ntHash only takes 0.02ns/bp extra. The sliding window computation nearly triples the total time. Collecting the minimizer positions to a linear vector (i.e., transposing matrices and writing output for each of the 8 chunks) again incurs 50% overhead, ad deduplicating them actually again adds some time.
Going back a step, using canonical ntHash instead of forward ntHash takes 0.14 ns/bp extra, and determining the canonical strand (via a third -delayed stream and counting GT bases) takes another 0.49 ns/bp. As before, collecting and deduplicating are slow and add around 0.70ns/bp.
In conclusion, we see that iterating the chunks of the input and determining minimizers is quite fast, but that a lot of time must then be spent to “deinterleave” the output into a linear stream. As can be expected, canonical minimizers are slower to compute than forward minimizers, but the overhead is less than , which seems quite low given that the ntHash and sliding window minimum computation are duplicated and a canonical-strand computation is added.
Full comparison.
We compare against the minimizer-iter crate (v1.2.1) [22], which implements a queue-based sliding window minimum using wyhash [34] and also supports canonical minimizers. For an additional comparison, we optimized an implementation of the remap method with ntHash based on a code snippet by Daniel Liu [20].
Results are in Table 2 and Figure 5. minimizer-iter takes around 26ns/bp for forward and 33ns/bp for canonical minimizers, and its runtime does not depend much on and , because the popping from the queue is unpredictable regardless of . Rescan starts out at 11.7 ns/bp for and gets significantly faster as increases, converging to around 5 ns/bp for . This is explained by the fact that rescan has a branch miss every time the current minimizer falls out of the window, which happens for roughly half the minimizers at a rate of . Thus, as increases, the method becomes more predictable and branch misses go down. Our method, simd-minimizers, runs around 1.61 ns/bp for forward and 2.20 ns/bp for canonical minimizers when given packed input, and therefore is to faster than the rescan method.
In Figure 5, we see that simd-minimizers’ performance is mostly independent of and since it is mostly data-independent. Only for small it is slightly slower due to the larger number of minimizers and hence larger size of the output.
As we use SIMD with lanes, we could in theory expect up to speedup. In practice this is hard to reach because of constant overhead and because the overhead to work well with SIMD in the first place. In particular for large , rescan benefits from very predictable and simple code and only outputs unique minimizer positions, making it very efficient. In SIMD, on the other hand, we use a data-independent algorithm, and output the minimizer position for every single window, which then has to be deduplicated. Thus, it is nice to see that even for large , our method is over faster, despite this overhead.
ASCII input and mulHash.
Apart from taking bit-packed input, simd-minimizers also works on ASCII-encoded DNA sequences of ACTG characters directly, which are then packed into values for ntHash (in that order) during iteration. This is around 0.25ns/bp slower, mostly because of the larger size of the unpacked input.
The mulHash variant is around 0.20ns/bp slower again, but works for any ASCII input. Performance on 100 MB of the Pizza&Chili corpus [5] English888https://pizzachili.dcc.uchile.cl/texts/nlang/ and Sources999https://pizzachili.dcc.uchile.cl/texts/code/ datasets is nearly identical to performance on the random DNA shown in Table 2.
Human genome.
We also run simd-minimizers on the chromosomes of a human genome (T2T-CHM13v2.0101010Available at https://github.com/marbl/CHM13 [26]), of total size 3.2 Gbp. Here, computing forward minimizers takes 5.19 seconds, and canonical minimizers takes 6.71 seconds for , which corresponds 1.67 and 2.15 ns/bp, which is within a few percent of Table 2.
Density.
For , we get a density of 0.173 for forward minimizers, and 0.167 for canonical minimizers, which are both close to the expected density of . Changing to , we get density 0.098 for forward minimizers and 0.100 for canonical minimizers, both close to the expected density of . Thus, we see that in practice, ntHash is a sufficiently random hash function to approach the expected density of random minimizers with a perfectly uniform random hash function.
Multithreading.
We test throughput in a multithreaded setting, by using 6 threads to process the 25 chromosomes (22, X, Y, and mitochondrion) in parallel. This way, processing takes 0.97 s (forward) and 1.27 s (canonical) when the input data is already loaded into memory, showing slightly above speedup. This is just below speedup as the chromosomes don’t perfectly partition the data into 6 equal parts, so that some threads finish before others.
For applications, we recommend to simply call our library in parallel from multiple threads as needed.
6 Conclusions and Future Work
Our library simd-minimizers computes minimizer positions to faster than other methods. Using the library, only a single function call is needed to obtain the list of (canonical) minimizer positions, taking as input the (packed) DNA sequence and the parameters and . General ASCII input are also supported, allowing use cases such as sketching protein sequences. We hope that the community will adopt simd-minimizers as the standard library to compute random minimizers.
Future work.
We chose to use the data-independent two-stacks method as the core of our algorithm, that returns the minimizer position of every window and then requires deduplication. Given the promising performance of rescan for large in Table 2, an interesting alternative could be to go the opposite way and speed up the rescan step. This is particularly relevant when minimizers are sparse, as the number of branches may be small enough for linear scans to benefit from SIMD. Accelerating linear scans could also be useful for smaller inputs such as short reads, where splitting into 8 chunks may not be very efficient.
Another approach would be to use 512-bit AVX512 instructions and process 16 lanes in parallel. In theory that could be another faster, but in practice the collecting and deduplicating of values may become an even larger bottleneck. In simple experiments, without further profiling and optimizing the code, it is in fact slower.
References
- [1] Gaëtan Benoit, Sébastien Raguideau, Robert James, Adam M Phillippy, Rayan Chikhi, and Christopher Quince. High-quality metagenome assembly from long accurate reads with metamdbg. Nature Biotechnology, pages 1–6, 2024. doi:10.1038/s41587-023-01983-6.
- [2] Jonathan D. Cohen. Recursive hashing functions for n-grams. ACM Trans. Inf. Syst., 15(3):291–320, July 1997. doi:10.1145/256163.256168.
- [3] Sebastian Deorowicz, Marek Kokot, Szymon Grabowski, and Agnieszka Debudaj-Grabysz. KMC 2: fast and resource-frugal k-mer counting. Bioinformatics, 31(10):1569–1576, January 2015. doi:10.1093/bioinformatics/btv022.
- [4] Barış Ekim, Bonnie Berger, and Rayan Chikhi. Minimizer-space de bruijn graphs: Whole-genome assembly of long reads in minutes on a personal computer. Cell Systems, 12(10):958–968.e6, 2021. doi:10.1016/j.cels.2021.08.009.
- [5] Paolo Ferragina and Gonzalo Navarro. The Pizza&Chili Corpus. https://pizzachili.dcc.uchile.cl/texts.html, 2007.
- [6] Shay Golan, Ido Tziony, Matan Kraus, Yaron Orenstein, and Arseny Shur. GreedyMini: Generating low-density DNA minimizers. bioRxiv, November 2024. doi:10.1101/2024.10.28.620726.
- [7] Ragnar Groot Koerkamp, Daniel Liu, and Giulio Ermanno Pibiri. The open-closed mod-minimizer algorithm. Algorithms for Molecular Biology, 20(4), March 2025. doi:10.1186/s13015-025-00270-0.
- [8] Ragnar Groot Koerkamp and Igor Martayan. rust-seq/packed-seq. Software, swhId: swh:1:dir:181f8591389fdf9b941ac6f225a26d2784514787 (visited on 2025-06-30). URL: https://github.com/rust-seq/packed-seq, doi:10.4230/artifacts.23786.
- [9] Ragnar Groot Koerkamp and Igor Martayan. rust-seq/simd-minimizers. Software, swhId: swh:1:dir:d2fbef4c7750f02bc5cb084d0873078fef7f4e22 (visited on 2025-06-30). URL: https://github.com/rust-seq/simd-minimizers, doi:10.4230/artifacts.23785.
- [10] Ragnar Groot Koerkamp and Igor Martayan. SimdMinimizers: Computing random minimizers, fast. bioRxiv, 2025. doi:10.1101/2025.01.27.634998.
- [11] Ragnar Groot Koerkamp and Giulio Ermanno Pibiri. The mod-minimizer: A Simple and Efficient Sampling Algorithm for Long k-mers. In Solon P. Pissis and Wing-Kin Sung, editors, 24th International Workshop on Algorithms in Bioinformatics (WABI 2024), volume 312 of Leibniz International Proceedings in Informatics (LIPIcs), pages 11:1–11:23, Dagstuhl, Germany, 2024. Schloss Dagstuhl – Leibniz-Zentrum für Informatik. doi:10.4230/LIPIcs.WABI.2024.11.
- [12] Martin Hirzel, Scott Schneider, and Kanat Tangwongsan. Sliding-window aggregation algorithms: Tutorial. In Proceedings of the 11th ACM International Conference on Distributed and Event-based Systems, DEBS 2017, Barcelona, Spain, June 19-23, 2017, pages 11–14. ACM, 2017. doi:10.1145/3093742.3095107.
- [13] Guillaume Holley and Páll Melsted. Bifrost: highly parallel construction and indexing of colored and compacted de bruijn graphs. Genome biology, 21:1–20, 2020.
- [14] Chirag Jain, Arang Rhie, Nancy F Hansen, Sergey Koren, and Adam M Phillippy. Long-read mapping to repetitive reference sequences using winnowmap2. Nature Methods, 19(6):705–710, 2022. doi:10.1038/s41592-022-01457-8.
- [15] Richard M. Karp and Michael O. Rabin. Efficient randomized pattern-matching algorithms. IBM Journal of Research and Development, 31(2):249–260, 1987. doi:10.1147/rd.312.0249.
- [16] Parham Kazemi, Johnathan Wong, Vladimir Nikolić, Hamid Mohamadi, René L Warren, and Inanç Birol. nthash2: recursive spaced seed hashing for nucleotide sequences. Bioinformatics, 38(20):4812–4813, August 2022. doi:10.1093/bioinformatics/btac564.
- [17] Bryce Kille, Ragnar Groot Koerkamp, Drake McAdams, Alan Liu, and Todd J Treangen. A near-tight lower bound on the density of forward sampling schemes. Bioinformatics, 41(1):btae736, December 2024. doi:10.1093/bioinformatics/btae736.
- [18] Daniel Lemire. Removing duplicates from lists quickly. https://lemire.me/blog/2017/04/10/removing-duplicates-from-lists-quickly/, April 2017.
- [19] Heng Li. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics, 34(18):3094–3100, May 2018. doi:10.1093/bioinformatics/bty191.
- [20] Daniel Liu. minimizer.rs re-scan implementation. https://gist.github.com/Daniel-Liu-c0deb0t/7078ebca04569068f15507aa856be6e8, July 2023.
- [21] Camille Marchet, Mael Kerbiriou, and Antoine Limasset. Blight: efficient exact associative structure for k-mers. Bioinformatics, 37(18):2858–2865, April 2021. doi:10.1093/bioinformatics/btab217.
- [22] Igor Martayan. minimizer-iter: Iterate over minimizers of a DNA sequence. https://github.com/rust-seq/minimizer-iter.
- [23] Igor Martayan, Lucas Robidou, Yoshihiro Shibuya, and Antoine Limasset. Hyper-k-mers: Efficient streaming k-mers representation. In Research in Computational Molecular Biology, page 330–335. Springer Nature Switzerland, 2025. doi:10.1007/978-3-031-90252-9_33.
- [24] Guillaume Marçais, C S Elder, and Carl Kingsford. k-nonical space: sketching with reverse complements. Bioinformatics, 40(11):btae629, October 2024. doi:10.1093/bioinformatics/btae629.
- [25] Hamid Mohamadi, Justin Chu, Benjamin P. Vandervalk, and Inanc Birol. nthash: recursive nucleotide hashing. Bioinformatics, 32(22):3492–3494, July 2016. doi:10.1093/bioinformatics/btw397.
- [26] Sergey Nurk, Sergey Koren, Arang Rhie, Mikko Rautiainen, Andrey V. Bzikadze, Alla Mikheenko, Mitchell R. Vollger, Nicolas Altemose, Lev Uralsky, Ariel Gershman, Sergey Aganezov, Savannah J. Hoyt, Mark Diekhans, Glennis A. Logsdon, Michael Alonge, Stylianos E. Antonarakis, Matthew Borchers, Gerard G. Bouffard, Shelise Y. Brooks, Gina V. Caldas, Nae-Chyun Chen, Haoyu Cheng, Chen-Shan Chin, William Chow, Leonardo G. de Lima, Philip C. Dishuck, Richard Durbin, Tatiana Dvorkina, Ian T. Fiddes, Giulio Formenti, Robert S. Fulton, Arkarachai Fungtammasan, Erik Garrison, Patrick G. S. Grady, Tina A. Graves-Lindsay, Ira M. Hall, Nancy F. Hansen, Gabrielle A. Hartley, Marina Haukness, Kerstin Howe, Michael W. Hunkapiller, Chirag Jain, Miten Jain, Erich D. Jarvis, Peter Kerpedjiev, Melanie Kirsche, Mikhail Kolmogorov, Jonas Korlach, Milinn Kremitzki, Heng Li, Valerie V. Maduro, Tobias Marschall, Ann M. McCartney, Jennifer McDaniel, Danny E. Miller, James C. Mullikin, Eugene W. Myers, Nathan D. Olson, Benedict Paten, Paul Peluso, Pavel A. Pevzner, David Porubsky, Tamara Potapova, Evgeny I. Rogaev, Jeffrey A. Rosenfeld, Steven L. Salzberg, Valerie A. Schneider, Fritz J. Sedlazeck, Kishwar Shafin, Colin J. Shew, Alaina Shumate, Ying Sims, Arian F. A. Smit, Daniela C. Soto, Ivan Sović, Jessica M. Storer, Aaron Streets, Beth A. Sullivan, Françoise Thibaud-Nissen, James Torrance, Justin Wagner, Brian P. Walenz, Aaron Wenger, Jonathan M. D. Wood, Chunlin Xiao, Stephanie M. Yan, Alice C. Young, Samantha Zarate, Urvashi Surti, Rajiv C. McCoy, Megan Y. Dennis, Ivan A. Alexandrov, Jennifer L. Gerton, Rachel J. O’Neill, Winston Timp, Justin M. Zook, Michael C. Schatz, Evan E. Eichler, Karen H. Miga, and Adam M. Phillippy. The complete sequence of a human genome. Science, 376(6588):44–53, April 2022. doi:10.1126/science.abj6987.
- [27] Chenxu Pan and Knut Reinert. A simple refined dna minimizer operator enables 2-fold faster computation. Bioinformatics, 40(2):btae045, January 2024. doi:10.1093/bioinformatics/btae045.
- [28] David Pellow, Lianrong Pu, Bariş Ekim, Lior Kotlar, Bonnie Berger, Ron Shamir, and Yaron Orenstein. Efficient minimizer orders for large values of k using minimum decycling sets. Genome Research, 33(7):1154–1161, 2023. doi:10.1101/gr.277644.123.
- [29] Giulio Ermanno Pibiri. Sparse and skew hashing of k-mers. Bioinformatics, 38(Supplement_1):i185–i194, June 2022. doi:10.1093/bioinformatics/btac245.
- [30] Michael Roberts, Wayne Hayes, Brian R. Hunt, Stephen M. Mount, and James A. Yorke. Reducing storage requirements for biological sequence comparison. Bioinformatics, 20(18):3363–3369, July 2004. doi:10.1093/bioinformatics/bth408.
- [31] Saul Schleimer, Daniel S. Wilkerson, and Alex Aiken. Winnowing: local algorithms for document fingerprinting. In Proceedings of the 2003 ACM SIGMOD international conference on Management of data, SIGMOD ’03, pages 76–85, New York, NY, USA, June 2003. Association for Computing Machinery. doi:10.1145/872757.872770.
- [32] Georgios Theodorakis, Alexandros Koliousis, Peter R. Pietzuch, and Holger Pirk. Hammer slide: Work- and cpu-efficient streaming window aggregation. In Rajesh Bordawekar and Tirthankar Lahiri, editors, International Workshop on Accelerating Analytics and Data Management Systems Using Modern Processor and Storage Architectures, ADMS@VLDB 2018, Rio de Janeiro, Brazil, August 27, 2018, pages 34–41, 2018. URL: http://www.adms-conf.org/2018-camera-ready/SIMDWindowPaper_ADMS%2718.pdf.
- [33] Derrick E Wood, Jennifer Lu, and Ben Langmead. Improved metagenomic analysis with Kraken 2. Genome biology, 20:1–13, 2019. doi:10.1186/s13059-019-1891-0.
- [34] Wang Yi and Diego Barrios Romero. wyhash-rs, fast portable non-cryptographic hashing algorithm. https://github.com/eldruin/wyhash-rs.
- [35] Hongyu Zheng, Carl Kingsford, and Guillaume Marçais. Improved design and analysis of practical minimizers. Bioinformatics, 36(Supplement_1):i119–i127, July 2020. doi:10.1093/bioinformatics/btaa472.
Appendix A Results for NEON Architecture
The NEON experiments are run on a performance core on an Apple M1 chip with 4 efficiency and 4 performance cores. The performance core runs at 3.2GHz and has 128KiB of L1 cache, 12MiB of shared L2 cache (for the performance cores) and 8MiB of shared L3 cache (for the whole system).
| Part | ns/bp |
|---|---|
| Gather and sum all bases | 0.10 |
| + collect to vector | 0.18 |
| + collect the delayed bases | 0.19 |
| + ntHash | 0.33 |
| + sliding window min | 0.82 |
| + collect | 0.95 |
| + dedup | 1.38 |
| + canonical nthash | 0.85 |
| + canonical strand | 1.16 |
| + collect | 1.37 |
| + dedup | 1.82 |
| : | ||||||
|---|---|---|---|---|---|---|
| Method | fwd. | cano. | fwd. | cano. | fwd. | cano. |
| minimizer-iter | 12.89 | 16.60 | 14.26 | 18.08 | 14.20 | 18.04 |
| Rescan ntHash | 5.95 | – | 3.92 | – | 2.82 | – |
| simd-minimizers ntHash | ||||||
| – packed input | 1.42 | 1.85 | 1.38 | 1.82 | 1.38 | 1.83 |
| – on-the-fly packing | 1.66 | 2.07 | 1.58 | 1.99 | 1.59 | 2.02 |
| Rescan mulHash | 6.94 | – | 3.93 | – | 3.42 | – |
| simd-minimizers mulHash | ||||||
| – packed input | 1.97 | 4.34 | 1.91 | 4.27 | 1.93 | 4.26 |
| – on-the-fly packing | 2.11 | 4.37 | 2.04 | 4.41 | 2.03 | 4.30 |
| – ASCII input | 2.12 | 4.39 | 2.06 | 4.44 | 2.04 | 4.29 |
