Exact Sketch-Based Read Mapping

Authors Tizian Schulz , Paul Medvedev



PDF
Thumbnail PDF

File

LIPIcs.WABI.2023.14.pdf
  • Filesize: 0.81 MB
  • 19 pages

Document Identifiers

Author Details

Tizian Schulz
  • Faculty of Technology and Center for Biotechnology (CeBiTec), Bielefeld University, Germany
  • Bielefeld Institute for Bioinformatics Infrastructure (BIBI), Bielefeld University, Germany
  • Graduate School "Digital Infrastructure for the Life Sciences" (DILS), Bielefeld University, Germany
Paul Medvedev
  • Department of Computer Science and Engineering, The Pennsylvania State University, University Park, PA, USA
  • Department of Biochemistry and Molecular Biology, The Pennsylvania State University, University Park, PA, USA
  • Huck Institutes of the Life Sciences, The Pennsylvania State University, University Park, PA, USA

Acknowledgements

We would like to thank K. Sahlin for helpful early feedback.

Cite AsGet BibTex

Tizian Schulz and Paul Medvedev. Exact Sketch-Based Read Mapping. In 23rd International Workshop on Algorithms in Bioinformatics (WABI 2023). Leibniz International Proceedings in Informatics (LIPIcs), Volume 273, pp. 14:1-14:19, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2023)
https://doi.org/10.4230/LIPIcs.WABI.2023.14

Abstract

Given a sequencing read, the broad goal of read mapping is to find the location(s) in the reference genome that have a "similar sequence". Traditionally, "similar sequence" was defined as having a high alignment score and read mappers were viewed as heuristic solutions to this well-defined problem. For sketch-based mappers, however, there has not been a problem formulation to capture what problem an exact sketch-based mapping algorithm should solve. Moreover, there is no sketch-based method that can find all possible mapping positions for a read above a certain score threshold. In this paper, we formulate the problem of read mapping at the level of sequence sketches. We give an exact dynamic programming algorithm that finds all hits above a given similarity threshold. It runs in {O}(|t| + |p| + 𝓁²) time and Θ(𝓁²) space, where |t| is the number of k-mers inside the sketch of the reference, |p| is the number of k-mers inside the read’s sketch and 𝓁 is the number of times that k-mers from the pattern sketch occur in the sketch of the text. We evaluate our algorithm’s performance in mapping long reads to the T2T assembly of human chromosome Y, where ampliconic regions make it desirable to find all good mapping positions. For an equivalent level of precision as minimap2, the recall of our algorithm is 0.88, compared to only 0.76 of minimap2.

Subject Classification

ACM Subject Classification
  • Applied computing → Computational biology
Keywords
  • Sequence Sketching
  • Long-read Mapping
  • Exact Algorithm
  • Dynamic Programming

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Can Alkan, Jeffrey M Kidd, Tomas Marques-Bonet, Gozde Aksay, Francesca Antonacci, Fereydoun Hormozdiari, Jacob O Kitzman, Carl Baker, Maika Malig, Onur Mutlu, et al. Personalized copy number and segmental duplication maps using next-generation sequencing. Nature genetics, 41(10):1061-1067, 2009. Google Scholar
  2. Stephen F Altschul, Warren Gish, Webb Miller, Eugene W Myers, and David J Lipman. Basic local alignment search tool. Journal of Molecular Biology, 215(3):403-410, 1990. URL: https://doi.org/10.1016/S0022-2836(05)80360-2.
  3. Mahdi Belbasi, Antonio Blanca, Robert S Harris, David Koslicki, and Paul Medvedev. The minimizer jaccard estimator is biased and inconsistent. Bioinformatics, 38(Supplement_1):i169-i176, June 2022. URL: https://doi.org/10.1093/bioinformatics/btac244.
  4. Antonio Blanca, Robert S Harris, David Koslicki, and Paul Medvedev. The statistics of k-mers from a sequence undergoing a simple mutation process without spurious matches. Journal of Computational Biology, 29(2):155-168, 2022. Google Scholar
  5. Monika Cechova, Rahulsimham Vegesna, Marta Tomaszkiewicz, Robert S Harris, Di Chen, Samarth Rangavittal, Paul Medvedev, and Kateryna D Makova. Dynamic evolution of great ape y chromosomes. Proceedings of the National Academy of Sciences, 117(42):26273-26280, 2020. Google Scholar
  6. Robert Edgar. Syncmers are more sensitive than minimizers for selecting conserved k-mers in biological sequences. PeerJ, 9:e10805, February 2021. Google Scholar
  7. Mahmudur Rahman Hera, N. Tessa Pierce-Ward, and David Koslicki. Debiasing FracMinHash and deriving confidence intervals for mutation rates across a wide range of evolutionary distances. bioRxiv, January 2022. Google Scholar
  8. Ting Hon, Kristin Mars, Greg Young, Yu-Chih Tsai, Joseph W Karalius, Jane M Landolin, Nicholas Maurer, David Kudrna, Michael A Hardigan, Cynthia C Steiner, et al. Highly accurate long-read hifi sequencing data for five complex genomes. Scientific data, 7(1):399, 2020. Google Scholar
  9. Luiz Irber, Phillip T. Brooks, Taylor Reiter, N. Tessa Pierce-Ward, Mahmudur Rahman Hera, David Koslicki, and C. Titus Brown. Lightweight compositional analysis of metagenomes with FracMinHash and minimum metagenome covers. bioRxiv, January 2022. URL: https://doi.org/10.1101/2022.01.11.475838.
  10. Chirag Jain, Arang Rhie, Nancy F Hansen, Sergey Koren, and Adam M Phillippy. Long-read mapping to repetitive reference sequences using Winnowmap2. Nature Methods, 19:705-710, 2022. Google Scholar
  11. Heng Li. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics, 34(18):3094-3100, 2018. Google Scholar
  12. Paul Medvedev, Monica Stanciu, and Michael Brudno. Computational methods for discovering structural variation with next-generation sequencing. Nature Methods, 6:S13, 2009. Google Scholar
  13. Sergey Nurk, Sergey Koren, Arang Rhie, Mikko Rautiainen, Andrey V Bzikadze, Alla Mikheenko, Mitchell R Vollger, Nicolas Altemose, Lev Uralsky, Ariel Gershman, et al. The complete sequence of a human genome. Science, 376(6588):44-53, 2022. Google Scholar
  14. Yukiteru Ono, Kiyoshi Asai, and Michiaki Hamada. Pbsim2: a simulator for long-read sequencers with a novel generative model of quality scores. Bioinformatics, 37(5):589-595, 2021. Google Scholar
  15. Michael Roberts, Wayne Hayes, Brian R Hunt, Stephen M Mount, and James A Yorke. Reducing storage requirements for biological sequence comparison. Bioinformatics, 20(18):3363-3369, 2004. Google Scholar
  16. Kristoffer Sahlin, Thomas Baudeau, Bastien Cazaux, and Camille Marchet. A survey of mapping algorithms in the long-reads era. Genome Biology, 24(1):1-23, 2023. Google Scholar
  17. Saul Schleimer, Daniel S Wilkerson, and Alex Aiken. Winnowing: Local algorithms for document fingerprinting. In Proceedings of the 22nd International conference on Management of Data (SIGMOD 2003), pages 76-85, 2003. Google Scholar
  18. Martin Šošić and Mile Šikić. Edlib: A C/C++ library for fast, exact sequence alignment using edit distance. Bioinformatics, 33(9):1394-1395, 2017. Google Scholar
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail