Fast Gapped k-mer Counting with Subdivided Multi-Way Bucketed Cuckoo Hash Tables

Authors Jens Zentgraf , Sven Rahmann



PDF
Thumbnail PDF

File

LIPIcs.WABI.2022.12.pdf
  • Filesize: 0.87 MB
  • 20 pages

Document Identifiers

Author Details

Jens Zentgraf
  • Department of Computer Science, Saarland University, Saarbrücken, Germany
  • Center for Bioinformatics, Saarland University, Saarbrücken, Germany
  • Saarbrücken Graduate School of Computer Science, Saarland University, Saarbrücken, Germany
Sven Rahmann
  • Department of Computer Science, Saarland University, Saarbrücken, Germany
  • Center for Bioinformatics, Saarland University, Saarbrücken, Germany

Acknowledgements

We thank Rajesh Moturu for running preliminary experiments with gapped k-mers.

Cite AsGet BibTex

Jens Zentgraf and Sven Rahmann. Fast Gapped k-mer Counting with Subdivided Multi-Way Bucketed Cuckoo Hash Tables. In 22nd International Workshop on Algorithms in Bioinformatics (WABI 2022). Leibniz International Proceedings in Informatics (LIPIcs), Volume 242, pp. 12:1-12:20, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2022)
https://doi.org/10.4230/LIPIcs.WABI.2022.12

Abstract

Motivation. In biological sequence analysis, alignment-free (also known as k-mer-based) methods are increasingly replacing mapping- and alignment-based methods for various applications. A basic step of such methods consists of building a table of all k-mers of a given set of sequences (a reference genome or a dataset of sequenced reads) and their counts. Over the past years, efficient methods and tools for k-mer counting have been developed. In a different line of work, the use of gapped k-mers has been shown to offer advantages over the use of the standard contiguous k-mers. However, no tool seems to be available that is able to count gapped k-mers with the same efficiency as contiguous k-mers. One reason is that the most efficient k-mer counters use minimizers (of a length m < k) to group k-mers into buckets, such that many consecutive k-mers are classified into the same bucket. This approach leads to cache-friendly (and hence extremely fast) algorithms, but the approach does not transfer easily to gapped k-mers. Consequently, the existing efficient k-mer counters cannot be trivially modified to count gapped k-mers with the same efficiency. Results. We present a different approach that is equally applicable to contiguous k-mers and gapped k-mers. We use multi-way bucketed Cuckoo hash tables to efficiently store (gapped) k-mers and their counts. We also describe a method to parallelize counting over multiple threads without using locks: We subdivide the hash table into independent subtables, and use a producer-consumer model, such that each thread serves one subtable. This requires designing Cuckoo hash functions with the property that all alternative locations for each k-mer are located in the same subtable. Compared to some of the fastest contiguous k-mer counters, our approach is of comparable speed, or even faster, on large datasets, and it is the only one that supports gapped k-mers.

Subject Classification

ACM Subject Classification
  • Applied computing → Molecular sequence analysis
  • Applied computing → Bioinformatics
  • Theory of computation → Bloom filters and hashing
  • Theory of computation → Data structures design and analysis
Keywords
  • gapped k-mer
  • k-mer
  • counting
  • Cuckoo hashing
  • parallelization

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. A. Bankevich, A. V. Bzikadze, M. Kolmogorov, D. Antipov, and P. A. Pevzner. Multiplex de Bruijn graphs enable genome assembly from long, high-fidelity reads. Nat Biotechnol, February 2022. Google Scholar
  2. N. L. Bray, H. Pimentel, P. Melsted, and L. Pachter. Near-optimal probabilistic RNA-seq quantification. Nat. Biotechnol., 34(5):525-527, May 2016. Erratum in Nat. Biotechnol. 34(8):888 (2016). Google Scholar
  3. Karel Břinda, Maciej Sykulski, and Gregory Kucherov. Spaced seeds improve k-mer-based metagenomic classification. Bioinformatics, 31(22):3584-3592, July 2015. URL: https://doi.org/10.1093/bioinformatics/btv419.
  4. Andrea Califano and Isidore Rigoutsos. FLASH: a fast look-up algorithm for string homology. In Lawrence Hunter, David B. Searls, and Jude W. Shavlik, editors, Proceedings of the 1st International Conference on Intelligent Systems for Molecular Biology, Bethesda, MD, USA, July 1993, pages 56-64. AAAI, 1993. URL: http://www.aaai.org/Library/ISMB/1993/ismb93-007.php.
  5. S. Deorowicz, M. Kokot, S. Grabowski, and A. Debudaj-Grabysz. KMC 2: fast and resource-frugal k-mer counting. Bioinformatics, 31(10):1569-1576, May 2015. Google Scholar
  6. M. Erbert, S. Rechner, and M. Müller-Hannemann. Gerbil: a fast and memory-efficient k-mer counter with GPU-support. Algorithms Mol Biol, 12:9, 2017. Google Scholar
  7. Lars Hahn, Chris-André Leimeister, Rachid Ounit, Stefano Lonardi, and Burkhard Morgenstern. rasbhari: Optimizing spaced seeds for database searching, read mapping and alignment-free sequence comparison. PLoS Comput. Biol., 12(10), 2016. URL: https://doi.org/10.1371/journal.pcbi.1005107.
  8. M. Kokot, M. Dlugosz, and S. Deorowicz. KMC 3: counting and manipulating k-mer statistics. Bioinformatics, 33(17):2759-2761, September 2017. Google Scholar
  9. Siu Kwan Lam, Antoine Pitrou, and Stanley Seibert. Numba: a LLVM-based python JIT compiler. In Hal Finkel, editor, Proceedings of the Second Workshop on the LLVM Compiler Infrastructure in HPC, LLVM 2015, pages 7:1-7:6. ACM, 2015. URL: https://doi.org/10.1145/2833157.2833162.
  10. J. Lu, F. P. Breitwieser, P. Thielen, and S. L. Salzberg. Bracken: estimating species abundance in metagenomics data. PeerJ Computer Science, 3:e104, 2017. URL: https://doi.org/10.7717/peerj-cs.104.
  11. Bin Ma, John Tromp, and Ming Li. Patternhunter: faster and more sensitive homology search. Bioinform., 18(3):440-445, 2002. URL: https://doi.org/10.1093/bioinformatics/18.3.440.
  12. Swati C Manekar and Shailesh R Sathe. A benchmark study of k-mer counting methods for high-throughput sequencing. GigaScience, 7(12), October 2018. giy125. URL: https://doi.org/10.1093/gigascience/giy125.
  13. Guillaume Marçais and Carl Kingsford. A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics, 27(6):764-770, January 2011. URL: https://doi.org/10.1093/bioinformatics/btr011.
  14. H. Mohamadi, J. Chu, B. P. Vandervalk, and I. Birol. ntHash: recursive nucleotide hashing. Bioinformatics, 32(22):3492-3494, November 2016. Google Scholar
  15. H. Mohamadi, H. Khan, and I. Birol. ntCard: a streaming algorithm for cardinality estimation in genomics data. Bioinformatics, 33(9):1324-1330, May 2017. Google Scholar
  16. A. Müller, C. Hundt, A. Hildebrandt, T. Hankeln, and B. Schmidt. MetaCache: context-aware classification of metagenomic reads using minhashing. Bioinformatics, 33(23):3740-3748, December 2017. Google Scholar
  17. L. Noé. Best hits of 11110110111: model-free selection and parameter-free sensitivity calculation of spaced seeds. Algorithms Mol Biol, 12:1, 2017. Google Scholar
  18. Laurent Noé. Best hits of 11110110111: model-free selection and parameter-free sensitivity calculation of spaced seeds. Algorithms Mol. Biol., 12(1):1:1-1:16, 2017. URL: https://doi.org/10.1186/s13015-017-0092-1.
  19. S. Nurk, S. Koren, A. Rhie, M. Rautiainen, A. V. Bzikadze, A. Mikheenko, M. R. Vollger, N. Altemose, L. Uralsky, A. Gershman, S. Aganezov, S. J. Hoyt, M. Diekhans, G. A. Logsdon, M. Alonge, S. E. Antonarakis, M. Borchers, G. G. Bouffard, S. Y. Brooks, G. V. Caldas, N. C. Chen, H. Cheng, C. S. Chin, W. Chow, L. G. de Lima, P. C. Dishuck, R. Durbin, T. Dvorkina, I. T. Fiddes, G. Formenti, R. S. Fulton, A. Fungtammasan, E. Garrison, P. G. S. Grady, T. A. Graves-Lindsay, I. M. Hall, N. F. Hansen, G. A. Hartley, M. Haukness, K. Howe, M. W. Hunkapiller, C. Jain, M. Jain, E. D. Jarvis, P. Kerpedjiev, M. Kirsche, M. Kolmogorov, J. Korlach, M. Kremitzki, H. Li, V. V. Maduro, T. Marschall, A. M. McCartney, J. McDaniel, D. E. Miller, J. C. Mullikin, E. W. Myers, N. D. Olson, B. Paten, P. Peluso, P. A. Pevzner, D. Porubsky, T. Potapova, E. I. Rogaev, J. A. Rosenfeld, S. L. Salzberg, V. A. Schneider, F. J. Sedlazeck, K. Shafin, C. J. Shew, A. Shumate, Y. Sims, A. F. A. Smit, D. C. Soto, I. Sović, J. M. Storer, A. Streets, B. A. Sullivan, F. Thibaud-Nissen, J. Torrance, J. Wagner, B. P. Walenz, A. Wenger, J. M. D. Wood, C. Xiao, S. M. Yan, A. C. Young, S. Zarate, U. Surti, R. C. McCoy, M. Y. Dennis, I. A. Alexandrov, J. L. Gerton, R. J. O'Neill, W. Timp, J. M. Zook, M. C. Schatz, E. E. Eichler, K. H. Miga, and A. M. Phillippy. The complete sequence of a human genome. Science, 376(6588):44-53, April 2022. Google Scholar
  20. C. Reimer, C. J. Rubin, A. R. Sharifi, N. T. Ha, S. Weigend, K. H. Waldmann, O. Distl, S. D. Pant, M. Fredholm, M. Schlather, and H. Simianer. Analysis of porcine body size variation using re-sequencing data of miniature and large pigs. BMC Genomics, 19(1):687, September 2018. Google Scholar
  21. Z. Sun, S. Huang, M. Zhang, Q. Zhu, N. Haiminen, A. P. Carrieri, Y. Vázquez-Baeza, L. Parida, H. C. Kim, R. Knight, and Y. Y. Liu. Challenges in benchmarking metagenomic profilers. Nat Methods, 18(6):618-626, June 2021. Google Scholar
  22. Stefan Walzer. Load thresholds for cuckoo hashing with overlapping blocks. In Ioannis Chatzigiannakis, Christos Kaklamanis, Dániel Marx, and Donald Sannella, editors, 45th International Colloquium on Automata, Languages, and Programming, ICALP 2018, volume 107 of LIPIcs, pages 102:1-102:10. Schloss Dagstuhl - Leibniz-Zentrum fuer Informatik, 2018. URL: https://doi.org/10.4230/LIPIcs.ICALP.2018.102.
  23. Jens Zentgraf and Sven Rahmann. Fast lightweight accurate xenograft sorting. In Carl Kingsford and Nadia Pisanti, editors, 20th International Workshop on Algorithms in Bioinformatics, WABI 2020, September 7-9, 2020, Pisa, Italy (Virtual Conference), volume 172 of LIPIcs, pages 4:1-4:16. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2020. URL: https://doi.org/10.4230/LIPIcs.WABI.2020.4.
  24. Jens Zentgraf and Sven Rahmann. Fast lightweight accurate xenograft sorting. Algorithms Mol. Biol., 16(1):2, 2021. URL: https://doi.org/10.1186/s13015-021-00181-w.
  25. Jens Zentgraf, Henning Timm, and Sven Rahmann. Cost-optimal assignment of elements in genome-scale multi-way bucketed cuckoo hash tables. In Guy E. Blelloch and Irene Finocchi, editors, Proceedings of the Symposium on Algorithm Engineering and Experiments, ALENEX 2020, Salt Lake City, UT, USA, January 5-6, 2020, pages 186-198. SIAM, 2020. URL: https://doi.org/10.1137/1.9781611976007.15.
  26. J. M. Zook, D. Catoe, J. McDaniel, L. Vang, N. Spies, A. Sidow, Z. Weng, Y. Liu, C. E. Mason, N. Alexander, E. Henaff, A. B. McIntyre, D. Chandramohan, F. Chen, E. Jaeger, A. Moshrefi, K. Pham, W. Stedman, T. Liang, M. Saghbini, Z. Dzakula, A. Hastie, H. Cao, G. Deikus, E. Schadt, R. Sebra, A. Bashir, R. M. Truty, C. C. Chang, N. Gulbahce, K. Zhao, S. Ghosh, F. Hyland, Y. Fu, M. Chaisson, C. Xiao, J. Trow, S. T. Sherry, A. W. Zaranek, M. Ball, J. Bobe, P. Estep, G. M. Church, P. Marks, S. Kyriazopoulou-Panagiotopoulou, G. X. Zheng, M. Schnall-Levin, H. S. Ordonez, P. A. Mudivarti, K. Giorda, Y. Sheng, K. B. Rypdal, and M. Salit. Extensive sequencing of seven human genomes to characterize benchmark reference materials. Sci Data, 3:160025, June 2016. Google Scholar
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail