ACM Other Conferences

10.1145/acmotherconferences

0000000

10.5555/0000000

Proceedings of the 23rd International Workshop on Algorithms in Bioinformatics (WABI 2023)

WABI 2023

10.4230/LIPIcs.WABI.2023.14

10010405.10010444.10010087

Applied computing~Computational biology

500

Exact Sketch-Based Read Mapping

https://orcid.org/0000-0003-0744-7078

Schulz

Tizian

Faculty of Technology and Center for Biotechnology (CeBiTec), Bielefeld University, Germany tizian.schulz@uni-bielefeld.de Author Medvedev

Paul

Department of Computer Science and Engineering, The Pennsylvania State University, University Park, PA, USA pzm11@psu.edu Author

29 08 2023

14:1 14:19

Given a sequencing read, the broad goal of read mapping is to find the location(s) in the reference genome that have a "similar sequence". Traditionally, "similar sequence" was defined as having a high alignment score and read mappers were viewed as heuristic solutions to this well-defined problem. For sketch-based mappers, however, there has not been a problem formulation to capture what problem an exact sketch-based mapping algorithm should solve. Moreover, there is no sketch-based method that can find all possible mapping positions for a read above a certain score threshold.

In this paper, we formulate the problem of read mapping at the level of sequence sketches. We give an exact dynamic programming algorithm that finds all hits above a given similarity threshold. It runs in {O}(|t| + |p| + 𝓁²) time and Θ(𝓁²) space, where |t| is the number of k-mers inside the sketch of the reference, |p| is the number of k-mers inside the read’s sketch and 𝓁 is the number of times that k-mers from the pattern sketch occur in the sketch of the text. We evaluate our algorithm’s performance in mapping long reads to the T2T assembly of human chromosome Y, where ampliconic regions make it desirable to find all good mapping positions. For an equivalent level of precision as minimap2, the recall of our algorithm is 0.88, compared to only 0.76 of minimap2.

Sequence Sketching Long-read Mapping Exact Algorithm Dynamic Programming

Can Alkan, Jeffrey M Kidd, Tomas Marques-Bonet, Gozde Aksay, Francesca Antonacci, Fereydoun Hormozdiari, Jacob O Kitzman, Carl Baker, Maika Malig, Onur Mutlu, et al. Personalized copy number and segmental duplication maps using next-generation sequencing. Nature genetics, 41(10):1061-1067, 2009.

Stephen F Altschul, Warren Gish, Webb Miller, Eugene W Myers, and David J Lipman. Basic local alignment search tool. Journal of Molecular Biology, 215(3):403-410, 1990.10.1016/S0022-2836(05)80360-2

Mahdi Belbasi, Antonio Blanca, Robert S Harris, David Koslicki, and Paul Medvedev. The minimizer jaccard estimator is biased and inconsistent. Bioinformatics, 38(Supplement_1):i169-i176, June 2022.10.1093/bioinformatics/btac244

Antonio Blanca, Robert S Harris, David Koslicki, and Paul Medvedev. The statistics of k-mers from a sequence undergoing a simple mutation process without spurious matches. Journal of Computational Biology, 29(2):155-168, 2022.

Monika Cechova, Rahulsimham Vegesna, Marta Tomaszkiewicz, Robert S Harris, Di Chen, Samarth Rangavittal, Paul Medvedev, and Kateryna D Makova. Dynamic evolution of great ape y chromosomes. Proceedings of the National Academy of Sciences, 117(42):26273-26280, 2020.

Robert Edgar. Syncmers are more sensitive than minimizers for selecting conserved k-mers in biological sequences. PeerJ, 9:e10805, February 2021.

Mahmudur Rahman Hera, N. Tessa Pierce-Ward, and David Koslicki. Debiasing FracMinHash and deriving confidence intervals for mutation rates across a wide range of evolutionary distances. bioRxiv, January 2022.

Ting Hon, Kristin Mars, Greg Young, Yu-Chih Tsai, Joseph W Karalius, Jane M Landolin, Nicholas Maurer, David Kudrna, Michael A Hardigan, Cynthia C Steiner, et al. Highly accurate long-read hifi sequencing data for five complex genomes. Scientific data, 7(1):399, 2020.

Luiz Irber, Phillip T. Brooks, Taylor Reiter, N. Tessa Pierce-Ward, Mahmudur Rahman Hera, David Koslicki, and C. Titus Brown. Lightweight compositional analysis of metagenomes with FracMinHash and minimum metagenome covers. bioRxiv, January 2022.10.1101/2022.01.11.475838

Chirag Jain, Arang Rhie, Nancy F Hansen, Sergey Koren, and Adam M Phillippy. Long-read mapping to repetitive reference sequences using Winnowmap2. Nature Methods, 19:705-710, 2022.

Heng Li. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics, 34(18):3094-3100, 2018.

Paul Medvedev, Monica Stanciu, and Michael Brudno. Computational methods for discovering structural variation with next-generation sequencing. Nature Methods, 6:S13, 2009.

Sergey Nurk, Sergey Koren, Arang Rhie, Mikko Rautiainen, Andrey V Bzikadze, Alla Mikheenko, Mitchell R Vollger, Nicolas Altemose, Lev Uralsky, Ariel Gershman, et al. The complete sequence of a human genome. Science, 376(6588):44-53, 2022.

Yukiteru Ono, Kiyoshi Asai, and Michiaki Hamada. Pbsim2: a simulator for long-read sequencers with a novel generative model of quality scores. Bioinformatics, 37(5):589-595, 2021.

Michael Roberts, Wayne Hayes, Brian R Hunt, Stephen M Mount, and James A Yorke. Reducing storage requirements for biological sequence comparison. Bioinformatics, 20(18):3363-3369, 2004.

Kristoffer Sahlin, Thomas Baudeau, Bastien Cazaux, and Camille Marchet. A survey of mapping algorithms in the long-reads era. Genome Biology, 24(1):1-23, 2023.

Saul Schleimer, Daniel S Wilkerson, and Alex Aiken. Winnowing: Local algorithms for document fingerprinting. In Proceedings of the 22nd International conference on Management of Data (SIGMOD 2003), pages 76-85, 2003.

Martin Šošić and Mile Šikić. Edlib: A C/C++ library for fast, exact sequence alignment using edit distance. Bioinformatics, 33(9):1394-1395, 2017.

<book-part-wrapper xmlns:xlink="http://www.w3.org/1999/xlink" dtd-version="2.0" xml:lang="en" content-type="research-article">

<collection-meta collection-type="book-series">

<collection-id collection-id-type="doi">10.1145/acmotherconferences</collection-id>

<title-group>

<title>ACM Other Conferences</title>

</title-group>

</collection-meta>

<book-meta>

<book-id book-id-type="acm-id">0000000</book-id>

<book-id book-id-type="doi">10.5555/0000000</book-id>

<book-title-group>

<book-title>Proceedings of the 23rd International Workshop on Algorithms in Bioinformatics (WABI 2023)</book-title>

<alt-title alt-title-type="acronym">WABI 2023</alt-title>

</book-title-group>

</book-meta>

<book-part book-part-type="chapter" xml:lang="en">

<book-part-meta>

<book-part-id book-part-id-type="doi">10.4230/LIPIcs.WABI.2023.14</book-part-id>

<book-part-id book-part-id-type="article-no">14</book-part-id>

<subj-group subj-group-type="ccs2012">

<compound-subject>

<compound-subject-part content-type="code">10010405.10010444.10010087</compound-subject-part>

<compound-subject-part content-type="text">Applied computing~Computational biology</compound-subject-part>

<compound-subject-part content-type="weight">500</compound-subject-part>

</compound-subject>

</subj-group>

<title-group>

<title>Exact Sketch-Based Read Mapping</title>

</title-group>

<contrib-group>

<contrib-id contrib-id-type="orcid">https://orcid.org/0000-0003-0744-7078</contrib-id>

<name>

<surname>Schulz</surname>

<given-names>Tizian</given-names>

</name>

<aff>Faculty of Technology and Center for Biotechnology (CeBiTec), Bielefeld University, Germany</aff>

<email>tizian.schulz@uni-bielefeld.de</email>

<role>Author</role>

</contrib>

<name>

<surname>Medvedev</surname>

<given-names>Paul</given-names>

</name>

<aff>Department of Computer Science and Engineering, The Pennsylvania State University, University Park, PA, USA</aff>

<role>Author</role>

</contrib>

</contrib-group>

<pub-date date-type="publication">

</pub-date>

<p>Given a sequencing read, the broad goal of read mapping is to find the location(s) in the reference genome that have a "similar sequence". Traditionally, "similar sequence" was defined as having a high alignment score and read mappers were viewed as heuristic solutions to this well-defined problem. For sketch-based mappers, however, there has not been a problem formulation to capture what problem an exact sketch-based mapping algorithm should solve. Moreover, there is no sketch-based method that can find all possible mapping positions for a read above a certain score threshold. </p>

<p>In this paper, we formulate the problem of read mapping at the level of sequence sketches. We give an exact dynamic programming algorithm that finds all hits above a given similarity threshold. It runs in {O}(|t| + |p| + 𝓁²) time and Θ(𝓁²) space, where |t| is the number of k-mers inside the sketch of the reference, |p| is the number of k-mers inside the read’s sketch and 𝓁 is the number of times that k-mers from the pattern sketch occur in the sketch of the text. We evaluate our algorithm’s performance in mapping long reads to the T2T assembly of human chromosome Y, where ampliconic regions make it desirable to find all good mapping positions. For an equivalent level of precision as minimap2, the recall of our algorithm is 0.88, compared to only 0.76 of minimap2.</p>

</abstract>

<kwd-group>

<kwd>Sequence Sketching</kwd>

<kwd>Long-read Mapping</kwd>

<kwd>Exact Algorithm</kwd>

<kwd>Dynamic Programming</kwd>

</kwd-group>

</book-part-meta>

<back>

<ref-list specific-use="unparsed">

<mixed-citation>Can Alkan, Jeffrey M Kidd, Tomas Marques-Bonet, Gozde Aksay, Francesca Antonacci, Fereydoun Hormozdiari, Jacob O Kitzman, Carl Baker, Maika Malig, Onur Mutlu, et al. Personalized copy number and segmental duplication maps using next-generation sequencing. Nature genetics, 41(10):1061-1067, 2009.</mixed-citation>

</ref>

<mixed-citation>

Stephen F Altschul, Warren Gish, Webb Miller, Eugene W Myers, and David J Lipman. Basic local alignment search tool. Journal of Molecular Biology, 215(3):403-410, 1990.

<pub-id pub-id-type="doi" xlink:href="10.1016/S0022-2836(05)80360-2">10.1016/S0022-2836(05)80360-2</pub-id>

</mixed-citation>

</ref>

<mixed-citation>

Mahdi Belbasi, Antonio Blanca, Robert S Harris, David Koslicki, and Paul Medvedev. The minimizer jaccard estimator is biased and inconsistent. Bioinformatics, 38(Supplement_1):i169-i176, June 2022.

<pub-id pub-id-type="doi" xlink:href="10.1093/bioinformatics/btac244">10.1093/bioinformatics/btac244</pub-id>

</mixed-citation>

</ref>

<mixed-citation>Antonio Blanca, Robert S Harris, David Koslicki, and Paul Medvedev. The statistics of k-mers from a sequence undergoing a simple mutation process without spurious matches. Journal of Computational Biology, 29(2):155-168, 2022.</mixed-citation>

</ref>

<mixed-citation>Monika Cechova, Rahulsimham Vegesna, Marta Tomaszkiewicz, Robert S Harris, Di Chen, Samarth Rangavittal, Paul Medvedev, and Kateryna D Makova. Dynamic evolution of great ape y chromosomes. Proceedings of the National Academy of Sciences, 117(42):26273-26280, 2020.</mixed-citation>

</ref>

<mixed-citation>Robert Edgar. Syncmers are more sensitive than minimizers for selecting conserved k-mers in biological sequences. PeerJ, 9:e10805, February 2021.</mixed-citation>

</ref>

<mixed-citation>Mahmudur Rahman Hera, N. Tessa Pierce-Ward, and David Koslicki. Debiasing FracMinHash and deriving confidence intervals for mutation rates across a wide range of evolutionary distances. bioRxiv, January 2022.</mixed-citation>

</ref>

<mixed-citation>Ting Hon, Kristin Mars, Greg Young, Yu-Chih Tsai, Joseph W Karalius, Jane M Landolin, Nicholas Maurer, David Kudrna, Michael A Hardigan, Cynthia C Steiner, et al. Highly accurate long-read hifi sequencing data for five complex genomes. Scientific data, 7(1):399, 2020.</mixed-citation>

</ref>

<mixed-citation>

<pub-id pub-id-type="doi" xlink:href="10.1101/2022.01.11.475838">10.1101/2022.01.11.475838</pub-id>

</mixed-citation>

</ref>

<mixed-citation>Chirag Jain, Arang Rhie, Nancy F Hansen, Sergey Koren, and Adam M Phillippy. Long-read mapping to repetitive reference sequences using Winnowmap2. Nature Methods, 19:705-710, 2022.</mixed-citation>

</ref>

<mixed-citation>Heng Li. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics, 34(18):3094-3100, 2018.</mixed-citation>

</ref>

<mixed-citation>Paul Medvedev, Monica Stanciu, and Michael Brudno. Computational methods for discovering structural variation with next-generation sequencing. Nature Methods, 6:S13, 2009.</mixed-citation>

</ref>

<mixed-citation>Sergey Nurk, Sergey Koren, Arang Rhie, Mikko Rautiainen, Andrey V Bzikadze, Alla Mikheenko, Mitchell R Vollger, Nicolas Altemose, Lev Uralsky, Ariel Gershman, et al. The complete sequence of a human genome. Science, 376(6588):44-53, 2022.</mixed-citation>

</ref>

<mixed-citation>Yukiteru Ono, Kiyoshi Asai, and Michiaki Hamada. Pbsim2: a simulator for long-read sequencers with a novel generative model of quality scores. Bioinformatics, 37(5):589-595, 2021.</mixed-citation>

</ref>

<mixed-citation>Michael Roberts, Wayne Hayes, Brian R Hunt, Stephen M Mount, and James A Yorke. Reducing storage requirements for biological sequence comparison. Bioinformatics, 20(18):3363-3369, 2004.</mixed-citation>

</ref>

<mixed-citation>Kristoffer Sahlin, Thomas Baudeau, Bastien Cazaux, and Camille Marchet. A survey of mapping algorithms in the long-reads era. Genome Biology, 24(1):1-23, 2023.</mixed-citation>

</ref>

<mixed-citation>Saul Schleimer, Daniel S Wilkerson, and Alex Aiken. Winnowing: Local algorithms for document fingerprinting. In Proceedings of the 22nd International conference on Management of Data (SIGMOD 2003), pages 76-85, 2003.</mixed-citation>

</ref>

<mixed-citation>Martin Šošić and Mile Šikić. Edlib: A C/C++ library for fast, exact sequence alignment using edit distance. Bioinformatics, 33(9):1394-1395, 2017.</mixed-citation>

</ref>

</ref-list>

</back>

</book-part>

</book-part-wrapper>