Accurate k-mer Classification Using Read Profiles

Authors Yoshihiko Suzuki , Gene Myers

Thumbnail PDF


  • Filesize: 3.23 MB
  • 20 pages

Document Identifiers

Author Details

Yoshihiko Suzuki
  • Okinawa Institute of Science and Technology Graduate University, Okinawa, Japan
Gene Myers
  • Okinawa Institute of Science and Technology Graduate University, Okinawa, Japan
  • Max Planck Institute of Molecular Cell Biology and Genetics, Dresden, Germany
  • Center for Systems Biology Dresden, Dresden, Germany


We wish to thank Shinichi Morishita, Yuta Suzuki, Bansho Masutani, Ryo Nakabayashi, Charles Plessy, and Michael Mansfield for their feedback and stimulating works. We also thank the Scientific Computing and Data Analysis section of Research Support Division and Communication and Public Relations Division at OIST for providing HPC resources and for proofreading of the manuscript, respectively.

Cite AsGet BibTex

Yoshihiko Suzuki and Gene Myers. Accurate k-mer Classification Using Read Profiles. In 22nd International Workshop on Algorithms in Bioinformatics (WABI 2022). Leibniz International Proceedings in Informatics (LIPIcs), Volume 242, pp. 10:1-10:20, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2022)


Contiguous strings of length k, called k-mers, are a fundamental element in many bioinformatics tasks. The number of occurrences of a k-mer in a given set of DNA sequencing reads, its k-mer count, has often been used to roughly estimate the copy number of a k-mer in the genome from which the reads were sampled. The problem of estimating copy numbers, called here the k-mer classification problem, has been based on simply analyzing the histogram of counts of all the k-mers in a data set, thus ignoring the positional context and dependency between multiple k-mers that appear nearby in the underlying genome. Here we present an efficient and significantly more accurate method for classifying k-mers by analyzing the sequence of k-mer counts along each sequencing read, called a read profile. By analyzing read profiles, we explicitly incorporate into the model the dependencies between the positionally adjacent k-mers and the sequence context-dependent error rates estimated from the given dataset. For long sequencing reads produced with the accurate high-fidelity (HiFi) sequencing technology, an implementation of our method, ClassPro, outperforms the conventional, histogram-based method in every simulation dataset of fruit fly and human with various realistic values of sequencing coverage and heterozygosity. Within only a few minutes, ClassPro achieves an average accuracy of > 99.99% across reads without repetitive k-mers and > 99.5% across all reads, in a typical fruit fly simulation data set with a 40× coverage. The resulting, more accurate k-mer classifications by ClassPro are in principle expected to improve any k-mer-based downstream analyses for sequenced reads such as read mapping and overlap, spectral alignment and error correction, haplotype phasing, and trio binning to name but a few. ClassPro is available at

Subject Classification

ACM Subject Classification
  • Applied computing → Molecular sequence analysis
  • K-mer
  • K-mer count
  • K-mer classification
  • HiFi sequencing


  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    PDF Downloads


  1. Anton Bankevich, Andrey V. Bzikadze, Mikhail Kolmogorov, Dmitry Antipov, and Pavel A. Pevzner. Multiplex de Bruijn graphs enable genome assembly from long, high-fidelity reads. Nature Biotechnology, 2022. URL:
  2. Jonathan Butler et al. ALLPATHS: De novo assembly of whole-genome shotgun microreads. Genome Research, 18(5):810-820, 2008. Google Scholar
  3. Andrey V. Bzikadze and Pavel A. Pevzner. Automated assembly of centromeres from ultra-long error-prone reads. Nature Biotechnology, 38(11):1309-1316, 2020. Google Scholar
  4. Haoyu Cheng, Gregory T. Concepcion, Xiaowen Feng, Haowen Zhang, and Heng Li. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nature Methods, 18(2):170-175, 2021. Google Scholar
  5. Chen-Shan Chin et al. A diploid assembly-based benchmark for variants in the major histocompatibility complex. Nature Communications, 11(1):4794, 2020. Google Scholar
  6. Giulio Formenti et al. Merfin: improved variant filtering and polishing via k-mer validation. bioRxiv, 2021. URL:
  7. Shilpa Garg. Computational methods for chromosome-scale haplotype reconstruction. Genome Biology, 22:101, 2021. URL:
  8. David Heller, Martin Vingron, George Church, Heng Li, and Shilpa Garg. SDip: A novel graph-based approach to haplotype-aware assembly based structural variant calling in targeted segmental duplications sequencing. bioRxiv, 2020. URL:
  9. Roger A. Hoskins et al. The Release 6 reference sequence of the Drosophila melanogaster genome. Genome Research, 25(3):445-458, 2015. Google Scholar
  10. J. O. Irwin. The frequency distribution of the difference between two independent variates following the same Poisson distribution. Journal of the Royal Statistical Society, 100(3):415-416, 1937. Google Scholar
  11. Erich D. Jarvis et al. Automated assembly of high-quality diploid human reference genomes. bioRxiv, 2022. URL:
  12. David R. Kelley, Michael C. Schatz, and Steven L. Salzberg. Quake: quality-aware detection and correction of sequencing errors. Genome Biology, 11:R116, 2010. URL:
  13. Marek Kokot, Maciej Długosz, and Sebastian Deorowicz. KMC 3: counting and manipulating k-mer statistics. Bioinformatics, 33(17):2759-2761, May 2017. Google Scholar
  14. Sergey Koren et al. De novo assembly of haplotype-resolved genomes with trio binning. Nature Biotechnology, 36(12):1174-1182, 2018. Google Scholar
  15. Eric S. Lander and Michael S. Waterman. Genomic mapping by fingerprinting random clones: A mathematical analysis. Genomics, 2(3):231-239, 1988. Google Scholar
  16. Daniel Mapleson, Gonzalo Garcia Accinelli, George Kettleborough, Jonathan Wright, and Bernardo J Clavijo. KAT: a K-mer analysis toolkit to quality control NGS datasets and genome assemblies. Bioinformatics, 33(4):574-576, November 2016. Google Scholar
  17. Guillaume Marçais and Carl Kingsford. A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics, 27(6):764-770, 2011. Google Scholar
  18. Eric Marinier, Daniel G. Brown, and Brendan J. McConkey. Pollux: platform independent error correction of single and mixed genomes. BMC Bioinformatics, 16(1):10, 2015. Google Scholar
  19. E. W. Myers. FastK., Accessed on 24/06/2022.
  20. E. W. Myers. HIsim., Accessed on 24/06/2022.
  21. Sergey Nurk et al. The complete sequence of a human genome. Science, 376(6588):44-53, 2022. URL:
  22. Sergey Nurk et al. HiCanu: accurate assembly of segmental duplications, satellites, and allelic variants from high-fidelity long reads. Genome Research, 30(9):1291-1305, 2020. Google Scholar
  23. Pavel A. Pevzner, Haixu Tang, and Michael S. Waterman. An Eulerian path approach to DNA fragment assembly. Proceedings of the National Academy of Sciences, 98(17):9748-9753, 2001. Google Scholar
  24. Nicolas Philippe, Mikaël Salson, Thérèse Commes, and Eric Rivals. CRAC: an integrated approach to the analysis of RNA-seq reads. Genome Biology, 14(3):R30, 2013. Google Scholar
  25. T. Rhyker Ranallo-Benavidez, Kamil S. Jaron, and Michael C. Schatz. GenomeScope 2.0 and Smudgeplot for reference-free profiling of polyploid genomes. Nature Communications, 11(1):1432, 2020. Google Scholar
  26. Arang Rhie et al. Towards complete and error-free genome assemblies of all vertebrate species. Nature, 592(7856):737-746, 2021. Google Scholar
  27. Arang Rhie, Brian P. Walenz, Sergey Koren, and Adam M. Phillippy. Merqury: reference-free quality, completeness, and phasing assessment for genome assemblies. Genome Biology, 21(1):245, 2020. Google Scholar
  28. Roy Ronen, Christina Boucher, Hamidreza Chitsaz, and Pavel Pevzner. SEQuel: improving the accuracy of genome assemblies. Bioinformatics, 28(12):i188-i196, 2012. Google Scholar
  29. Jared T. Simpson. Exploring genome characteristics and sequence quality without a reference. Bioinformatics, 30(9):1228-1235, 2014. Google Scholar
  30. The Darwin Tree of Life Project Consortium. Sequence locally, think globally: The Darwin Tree of Life Project. Proceedings of the National Academy of Sciences, 119(4):e2115642118, 2022. Google Scholar
  31. German Tischler and Eugene W. Myers. Non hybrid long read consensus using local de Bruijn graph assembly. bioRxiv, 2017. URL:
  32. Brian Walenz et al. Meryl., Accessed on 24/06/2022.
  33. Ting Wang et al. The Human Pangenome Project: a global resource to map genomic diversity. Nature, 604(7906):437-446, 2022. Google Scholar
  34. Aaron M. Wenger et al. Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome. Nature Biotechnology, 37(10):1155-1162, 2019. Google Scholar
  35. Ryan R. Wick. Badread: simulation of error-prone long reads. Journal of Open Source Software, 4(36):1316, 2019. Google Scholar
  36. Xiaohong Zhao et al. EDAR: An efficient error detection and removal algorithm for next generation sequencing data. Journal of Computational Biology, 17(11):1549-1560, 2010. Google Scholar
  37. Justin M. Zook et al. Extensive sequencing of seven human genomes to characterize benchmark reference materials. Scientific Data, 3(1):160025, 2016. Google Scholar