,
Sven Rahmann
Creative Commons Attribution 4.0 International license
Motivation. Short DNA sequences of length k that appear in a single location (e.g., at a single genomic position, in a single species from a larger set of species, etc.) are called unique k-mers. They are useful for placing sequenced DNA fragments at the correct location without computing alignments and without ambiguity. However, they are not necessarily robust: A single basepair change may turn a unique k-mer into a different one that may in fact be present at one or more different locations, which may give confusing or contradictory information when attempting to place a read by its k-mer content. A more robust concept are strongly unique k-mers, i.e., unique k-mers for which no Hamming-distance-1 neighbor with conflicting information exists in all of the considered sequences. Given a set of k-mers, it is therefore of interest to have an efficient method that can distinguish k-mers with a Hamming-distance-1 neighbor in the collection from those that do not. Results. We present engineered algorithms to identify and mark within a set K of (canonical) k-mers all elements that have a Hamming-distance-1 neighbor in the same set. One algorithm is based on recursively running a 4-way comparison on sub-intervals of the sorted set. The other algorithm is based on bucketing and running a pairwise bit-parallel Hamming distance test on small buckets of the sorted set. Both methods consider canonical k-mers (i.e., taking reverse complements into account) and allow for efficient parallelization. The methods have been implemented and applied in practice to sets consisting of several billions of k-mers. An optimized combined approach running with 16 threads on a 16-core workstation, yields wall-clock running times below 20 seconds on the 2.5 billion distinct 31-mers of the human telomere-to-telomere reference genome. Availability. An implementation can be found at https://gitlab.com/rahmannlab/strong-k-mers.
@InProceedings{zentgraf_et_al:LIPIcs.WABI.2024.15,
author = {Zentgraf, Jens and Rahmann, Sven},
title = {{Swiftly Identifying Strongly Unique k-Mers}},
booktitle = {24th International Workshop on Algorithms in Bioinformatics (WABI 2024)},
pages = {15:1--15:15},
series = {Leibniz International Proceedings in Informatics (LIPIcs)},
ISBN = {978-3-95977-340-9},
ISSN = {1868-8969},
year = {2024},
volume = {312},
editor = {Pissis, Solon P. and Sung, Wing-Kin},
publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
address = {Dagstuhl, Germany},
URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2024.15},
URN = {urn:nbn:de:0030-drops-206593},
doi = {10.4230/LIPIcs.WABI.2024.15},
annote = {Keywords: k-mer, Hamming distance, strong uniqueness, parallelization, algorithm engineering}
}
archived version