Swiftly Identifying Strongly Unique k-Mers

Authors Jens Zentgraf , Sven Rahmann



PDF
Thumbnail PDF

File

LIPIcs.WABI.2024.15.pdf
  • Filesize: 0.85 MB
  • 15 pages

Document Identifiers

Author Details

Jens Zentgraf
  • Algorithmic Bioinformatics, Department of Computer Science, Saarland University, Saarbrücken, Germany
  • Center for Bioinformatics Saar, Saarland Informatics Campus, Saarbrücken, Germany
  • Graduate School of Computer Science, Saarland Informatics Campus, Saarbrücken, Germany
Sven Rahmann
  • Algorithmic Bioinformatics, Department of Computer Science, Saarland University, Saarbrücken, Germany
  • Center for Bioinformatics Saar, Saarland Informatics Campus, Saarbrücken, Germany

Acknowledgements

We thank Karl Bringmann for discussions about the FourWay algorithm. We also thank the four reviewers of the initial version of this manuscript for their suggestions to improve its presentation.

Cite AsGet BibTex

Jens Zentgraf and Sven Rahmann. Swiftly Identifying Strongly Unique k-Mers. In 24th International Workshop on Algorithms in Bioinformatics (WABI 2024). Leibniz International Proceedings in Informatics (LIPIcs), Volume 312, pp. 15:1-15:15, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2024)
https://doi.org/10.4230/LIPIcs.WABI.2024.15

Abstract

Motivation. Short DNA sequences of length k that appear in a single location (e.g., at a single genomic position, in a single species from a larger set of species, etc.) are called unique k-mers. They are useful for placing sequenced DNA fragments at the correct location without computing alignments and without ambiguity. However, they are not necessarily robust: A single basepair change may turn a unique k-mer into a different one that may in fact be present at one or more different locations, which may give confusing or contradictory information when attempting to place a read by its k-mer content. A more robust concept are strongly unique k-mers, i.e., unique k-mers for which no Hamming-distance-1 neighbor with conflicting information exists in all of the considered sequences. Given a set of k-mers, it is therefore of interest to have an efficient method that can distinguish k-mers with a Hamming-distance-1 neighbor in the collection from those that do not. Results. We present engineered algorithms to identify and mark within a set K of (canonical) k-mers all elements that have a Hamming-distance-1 neighbor in the same set. One algorithm is based on recursively running a 4-way comparison on sub-intervals of the sorted set. The other algorithm is based on bucketing and running a pairwise bit-parallel Hamming distance test on small buckets of the sorted set. Both methods consider canonical k-mers (i.e., taking reverse complements into account) and allow for efficient parallelization. The methods have been implemented and applied in practice to sets consisting of several billions of k-mers. An optimized combined approach running with 16 threads on a 16-core workstation, yields wall-clock running times below 20 seconds on the 2.5 billion distinct 31-mers of the human telomere-to-telomere reference genome. Availability. An implementation can be found at https://gitlab.com/rahmannlab/strong-k-mers.

Subject Classification

ACM Subject Classification
  • Applied computing → Molecular sequence analysis
  • Applied computing → Bioinformatics
  • Theory of computation → Parallel algorithms
  • Theory of computation → Sorting and searching
  • Information systems → Nearest-neighbor search
Keywords
  • k-mer
  • Hamming distance
  • strong uniqueness
  • parallelization
  • algorithm engineering

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Nicolas L Bray, Harold Pimentel, Páll Melsted, and Lior Pachter. Near-optimal probabilistic rna-seq quantification. Nature biotechnology, 34(5):525-527, 2016. Google Scholar
  2. Florian P Breitwieser, Daniel N Baker, and Steven L Salzberg. KrakenUniq: confident and fast metagenomics classification using unique k-mer counts. Genome biology, 19(1):198, 2018. Google Scholar
  3. C. Titus Brown and Luiz Irber. sourmash: a library for minhash sketching of DNA. Journal of Open Source Software, 1(5):27, 2016. URL: https://doi.org/10.21105/joss.00027.
  4. S. Deorowicz, M. Kokot, S. Grabowski, and A. Debudaj-Grabysz. KMC 2: fast and resource-frugal k-mer counting. Bioinformatics, 31(10):1569-1576, May 2015. Google Scholar
  5. Diego Díaz-Domínguez, Miika Leinonen, and Leena Salmela. Space-efficient computation of k-mer dictionaries for large values of k. Algorithms for Molecular Biology, 19(1):14, 2024. Google Scholar
  6. M. Erbert, S. Rechner, and M. Müller-Hannemann. Gerbil: a fast and memory-efficient k-mer counter with GPU-support. Algorithms Mol Biol, 12:9, 2017. Google Scholar
  7. Pascal Hirsch, Leidy-Alejandra G Molano, Annika Engel, Jens Zentgraf, Sven Rahmann, Matthias Hannig, Rolf Müller, Fabian Kern, Andreas Keller, and Georges P Schmartz. Mibianto: ultra-efficient online microbiome analysis through k-mer based metagenomics. Nucleic Acids Research, page gkae364, 2024. Google Scholar
  8. M. Kokot, M. Dlugosz, and S. Deorowicz. KMC 3: counting and manipulating k-mer statistics. Bioinformatics, 33(17):2759-2761, September 2017. Google Scholar
  9. Siu Kwan Lam, Antoine Pitrou, and Stanley Seibert. Numba: a LLVM-based python JIT compiler. In Hal Finkel, editor, Proceedings of the Second Workshop on the LLVM Compiler Infrastructure in HPC, LLVM 2015, Austin, Texas, USA, November 15, 2015, pages 7:1-7:6. ACM, 2015. URL: https://doi.org/10.1145/2833157.2833162.
  10. G. Marcais and C. Kingsford. A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics, 27(6):764-770, March 2011. Google Scholar
  11. Sven Rahmann, Marcel Martin, Johannes H. Schulte, Johannes Köster, Tobias Marschall, and Alexander Schramm. Identifying transcriptional miRNA biomarkers by integrating high-throughput sequencing and real-time PCR data. Methods, 59(1):154-163, January 2013. Google Scholar
  12. Luca Renders, Lore Depuydt, Sven Rahmann, and Jan Fostier. Automated design of efficient search schemes for lossless approximate pattern matching. In Jian Ma, editor, Research in Computational Molecular Biology - 28th Annual International Conference, RECOMB 2024, Cambridge, MA, USA, April 29 - May 2, 2024, Proceedings, volume 14758 of Lecture Notes in Computer Science, pages 164-184. Springer, 2024. URL: https://doi.org/10.1007/978-1-0716-3989-4_11.
  13. Arang Rhie, Sergey Nurk, Monika Cechova, Savannah J Hoyt, Dylan J Taylor, Nicolas Altemose, Paul W Hook, Sergey Koren, Mikko Rautiainen, Ivan A Alexandrov, et al. The complete sequence of a human Y chromosome. Nature, 621(7978):344-354, 2023. Google Scholar
  14. Raffaella Rizzi, Stefano Beretta, Murray Patterson, Yuri Pirola, Marco Previtali, Gianluca Della Vedova, and Paola Bonizzoni. Overlap graphs and de Bruijn graphs: data structures for de novo genome assembly in the big data era. Quant. Biol., 7(4):278-292, 2019. URL: https://doi.org/10.1007/S40484-019-0181-X.
  15. Sebastian S. Schmidt and Jarno N. Alanko. Eulertigs: minimum plain text representation of k-mer sets without repetitions in linear time. Algorithms Mol. Biol., 18(1):5, 2023. URL: https://doi.org/10.1186/S13015-023-00227-1.
  16. Stéfan van der Walt, S. Chris Colbert, and Gaël Varoquaux. The NumPy array: A structure for efficient numerical computation. Comput. Sci. Eng., 13(2):22-30, 2011. URL: https://doi.org/10.1109/MCSE.2011.37.
  17. N. Whiteford, N. Haslam, G. Weber, A. Prügel-Bennett, J. W. Essex, P. L. Roach, M. Bradley, and C. Neylon. An analysis of the feasibility of short read sequencing. Nucleic acids research, 33(19):e171, 2005. URL: https://doi.org/10.1093/nar/gni170.
  18. Jens Zentgraf and Sven Rahmann. Identification of strongly unique k-mers. Software, version 1.0., swhId: https://archive.softwareheritage.org/swh:1:dir:7ce51b0df8003cb2d49b99084d09f2ce6df56638;origin=https://gitlab.com/rahmannlab/strong-k-mers;visit=swh:1:snp:69ece597a22096d56e1fa4598d2ccb6e6627fb52;anchor=swh:1:rev:7187800f25c4f697df674a935fe3d05e7d72eaf9 (visited on 2024-08-12). URL: https://gitlab.com/rahmannlab/strong-k-mers.
  19. Jens Zentgraf and Sven Rahmann. Fast lightweight accurate xenograft sorting. Algorithms Mol. Biol., 16(1):2, 2021. URL: https://doi.org/10.1186/S13015-021-00181-W.
  20. Jens Zentgraf and Sven Rahmann. Fast gapped k-mer counting with subdivided multi-way bucketed cuckoo hash tables. In Christina Boucher and Sven Rahmann, editors, 22nd International Workshop on Algorithms in Bioinformatics, WABI 2022, September 5-7, 2022, Potsdam, Germany, volume 242 of LIPIcs, pages 12:1-12:20. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2022. URL: https://doi.org/10.4230/LIPICS.WABI.2022.12.