Accurate k-mer Classification Using Read Profiles

Suzuki, Yoshihiko; Myers, Gene

doi:10.4230/LIPIcs.WABI.2022.10

Abstract

Contiguous strings of length k, called k-mers, are a fundamental element in many bioinformatics tasks. The number of occurrences of a k-mer in a given set of DNA sequencing reads, its k-mer count, has often been used to roughly estimate the copy number of a k-mer in the genome from which the reads were sampled. The problem of estimating copy numbers, called here the k-mer classification problem, has been based on simply analyzing the histogram of counts of all the k-mers in a data set, thus ignoring the positional context and dependency between multiple k-mers that appear nearby in the underlying genome. Here we present an efficient and significantly more accurate method for classifying k-mers by analyzing the sequence of k-mer counts along each sequencing read, called a read profile. By analyzing read profiles, we explicitly incorporate into the model the dependencies between the positionally adjacent k-mers and the sequence context-dependent error rates estimated from the given dataset. For long sequencing reads produced with the accurate high-fidelity (HiFi) sequencing technology, an implementation of our method, ClassPro, outperforms the conventional, histogram-based method in every simulation dataset of fruit fly and human with various realistic values of sequencing coverage and heterozygosity. Within only a few minutes, ClassPro achieves an average accuracy of > 99.99% across reads without repetitive k-mers and > 99.5% across all reads, in a typical fruit fly simulation data set with a 40× coverage. The resulting, more accurate k-mer classifications by ClassPro are in principle expected to improve any k-mer-based downstream analyses for sequenced reads such as read mapping and overlap, spectral alignment and error correction, haplotype phasing, and trio binning to name but a few. ClassPro is available at https://github.com/yoshihikosuzuki/ClassPro.

Anton Bankevich, Andrey V. Bzikadze, Mikhail Kolmogorov, Dmitry Antipov, and Pavel A. Pevzner. Multiplex de Bruijn graphs enable genome assembly from long, high-fidelity reads. Nature Biotechnology, 2022. URL: https://doi.org/10.1038/s41587-022-01220-6.
Jonathan Butler et al. ALLPATHS: De novo assembly of whole-genome shotgun microreads. Genome Research, 18(5):810-820, 2008.
Andrey V. Bzikadze and Pavel A. Pevzner. Automated assembly of centromeres from ultra-long error-prone reads. Nature Biotechnology, 38(11):1309-1316, 2020.
Haoyu Cheng, Gregory T. Concepcion, Xiaowen Feng, Haowen Zhang, and Heng Li. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nature Methods, 18(2):170-175, 2021.
Chen-Shan Chin et al. A diploid assembly-based benchmark for variants in the major histocompatibility complex. Nature Communications, 11(1):4794, 2020.
Giulio Formenti et al. Merfin: improved variant filtering and polishing via k-mer validation. bioRxiv, 2021. URL: https://doi.org/10.1101/2021.07.16.452324.
Shilpa Garg. Computational methods for chromosome-scale haplotype reconstruction. Genome Biology, 22:101, 2021. URL: https://doi.org/10.1186/s13059-021-02328-9.
David Heller, Martin Vingron, George Church, Heng Li, and Shilpa Garg. SDip: A novel graph-based approach to haplotype-aware assembly based structural variant calling in targeted segmental duplications sequencing. bioRxiv, 2020. URL: https://doi.org/10.1101/2020.02.25.964445.
Roger A. Hoskins et al. The Release 6 reference sequence of the Drosophila melanogaster genome. Genome Research, 25(3):445-458, 2015.
J. O. Irwin. The frequency distribution of the difference between two independent variates following the same Poisson distribution. Journal of the Royal Statistical Society, 100(3):415-416, 1937.
Erich D. Jarvis et al. Automated assembly of high-quality diploid human reference genomes. bioRxiv, 2022. URL: https://doi.org/10.1101/2022.03.06.483034.
David R. Kelley, Michael C. Schatz, and Steven L. Salzberg. Quake: quality-aware detection and correction of sequencing errors. Genome Biology, 11:R116, 2010. URL: https://doi.org/10.1186/gb-2010-11-11-r116.
Marek Kokot, Maciej Długosz, and Sebastian Deorowicz. KMC 3: counting and manipulating k-mer statistics. Bioinformatics, 33(17):2759-2761, May 2017.
Sergey Koren et al. De novo assembly of haplotype-resolved genomes with trio binning. Nature Biotechnology, 36(12):1174-1182, 2018.
Eric S. Lander and Michael S. Waterman. Genomic mapping by fingerprinting random clones: A mathematical analysis. Genomics, 2(3):231-239, 1988.
Daniel Mapleson, Gonzalo Garcia Accinelli, George Kettleborough, Jonathan Wright, and Bernardo J Clavijo. KAT: a K-mer analysis toolkit to quality control NGS datasets and genome assemblies. Bioinformatics, 33(4):574-576, November 2016.
Guillaume Marçais and Carl Kingsford. A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics, 27(6):764-770, 2011.
Eric Marinier, Daniel G. Brown, and Brendan J. McConkey. Pollux: platform independent error correction of single and mixed genomes. BMC Bioinformatics, 16(1):10, 2015.
E. W. Myers. FastK. https://github.com/thegenemyers/FASTK, Accessed on 24/06/2022.
E. W. Myers. HIsim. https://github.com/thegenemyers/HI.SIM, Accessed on 24/06/2022.
Sergey Nurk et al. The complete sequence of a human genome. Science, 376(6588):44-53, 2022. URL: https://doi.org/10.1126/science.abj6987.
Sergey Nurk et al. HiCanu: accurate assembly of segmental duplications, satellites, and allelic variants from high-fidelity long reads. Genome Research, 30(9):1291-1305, 2020.
Pavel A. Pevzner, Haixu Tang, and Michael S. Waterman. An Eulerian path approach to DNA fragment assembly. Proceedings of the National Academy of Sciences, 98(17):9748-9753, 2001.
Nicolas Philippe, Mikaël Salson, Thérèse Commes, and Eric Rivals. CRAC: an integrated approach to the analysis of RNA-seq reads. Genome Biology, 14(3):R30, 2013.
T. Rhyker Ranallo-Benavidez, Kamil S. Jaron, and Michael C. Schatz. GenomeScope 2.0 and Smudgeplot for reference-free profiling of polyploid genomes. Nature Communications, 11(1):1432, 2020.
Arang Rhie et al. Towards complete and error-free genome assemblies of all vertebrate species. Nature, 592(7856):737-746, 2021.
Arang Rhie, Brian P. Walenz, Sergey Koren, and Adam M. Phillippy. Merqury: reference-free quality, completeness, and phasing assessment for genome assemblies. Genome Biology, 21(1):245, 2020.
Roy Ronen, Christina Boucher, Hamidreza Chitsaz, and Pavel Pevzner. SEQuel: improving the accuracy of genome assemblies. Bioinformatics, 28(12):i188-i196, 2012.
Jared T. Simpson. Exploring genome characteristics and sequence quality without a reference. Bioinformatics, 30(9):1228-1235, 2014.
The Darwin Tree of Life Project Consortium. Sequence locally, think globally: The Darwin Tree of Life Project. Proceedings of the National Academy of Sciences, 119(4):e2115642118, 2022.
German Tischler and Eugene W. Myers. Non hybrid long read consensus using local de Bruijn graph assembly. bioRxiv, 2017. URL: https://doi.org/10.1101/106252.
Brian Walenz et al. Meryl. https://github.com/marbl/meryl, Accessed on 24/06/2022.
Ting Wang et al. The Human Pangenome Project: a global resource to map genomic diversity. Nature, 604(7906):437-446, 2022.
Aaron M. Wenger et al. Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome. Nature Biotechnology, 37(10):1155-1162, 2019.
Ryan R. Wick. Badread: simulation of error-prone long reads. Journal of Open Source Software, 4(36):1316, 2019.
Xiaohong Zhao et al. EDAR: An efficient error detection and removal algorithm for next generation sequencing data. Journal of Computational Biology, 17(11):1549-1560, 2010.
Justin M. Zook et al. Extensive sequencing of seven human genomes to characterize benchmark reference materials. Scientific Data, 3(1):160025, 2016.

Accurate k-mer Classification Using Read Profiles

Authors Yoshihiko Suzuki , Gene Myers

File

Document Identifiers

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

Acknowledgements

References

Thanks for your feedback!

Could not send message