Solving the Minimal Positional Substring Cover Problem in Sublinear Space

Authors Paola Bonizzoni , Christina Boucher , Davide Cozzi , Travis Gagie , Yuri Pirola



PDF
Thumbnail PDF

File

LIPIcs.CPM.2024.12.pdf
  • Filesize: 0.83 MB
  • 16 pages

Document Identifiers

Author Details

Paola Bonizzoni
  • Department of Computer Science, University of Milano-Bicocca, Italy
Christina Boucher
  • Department of Computer and Information Science and Engineering, University of Florida, Gainesville, FL, USA
Davide Cozzi
  • Department of Computer Science, University of Milano-Bicocca, Italy
Travis Gagie
  • Faculty of Computer Science, Dalhousie University, Halifax, Canada
Yuri Pirola
  • Department of Computer Science, University of Milano-Bicocca, Italy

Cite AsGet BibTex

Paola Bonizzoni, Christina Boucher, Davide Cozzi, Travis Gagie, and Yuri Pirola. Solving the Minimal Positional Substring Cover Problem in Sublinear Space. In 35th Annual Symposium on Combinatorial Pattern Matching (CPM 2024). Leibniz International Proceedings in Informatics (LIPIcs), Volume 296, pp. 12:1-12:16, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2024)
https://doi.org/10.4230/LIPIcs.CPM.2024.12

Abstract

Within the field of haplotype analysis, the Positional Burrows-Wheeler Transform (PBWT) stands out as a key innovation, addressing numerous challenges in genomics. For example, Sanaullah et al. introduced a PBWT-based method that addresses the haplotype threading problem, which involves representing a query haplotype through a minimal set of substrings. To solve this problem using the PBWT data structure, they formulate the Minimal Positional Substring Cover (MPSC) problem, and then, subsequently present a solution for it. Additionally, they present and solve several variants of this problem: k-MPSC, leftmost MPSC, rightmost MPSC, and length-maximal MPSC. Yet, a full PBWT is required for each of their solutions, which yields a significant memory usage requirement. Here, we take advantage of the latest results on run-length encoding the PBWT, to solve the MPSC in a sublinear amount of space. Our methods involve demonstrating that k-Set Maximal Exact Matches (k-SMEMs) can be computed in a sublinear amount of space via efficient computation of k-Matching Statistics (k-MS). This leads to a solution that requires sublinear space for, not only the MPSC problem, but for all its variations proposed by Sanaullah et al. Most importantly, we present experimental results on haplotype panels from the 1000 Genomes Project data that show the utility of these theoretical results. We conclusively demonstrate that our approach markedly decreases the memory required to solve the MPSC problem, achieving a reduction of at least two orders of magnitude compared to the method proposed by Sanaullah et al. This efficiency allows us to solve the problem on large versions of the problem, where other methods are unable to scale to. In summary, the creation of {μ}-PBWT paves the way for new possibilities in conducting in-depth genetic research and analysis on a large scale. All source code is publicly available at https://github.com/dlcgold/muPBWT/tree/k-smem.

Subject Classification

ACM Subject Classification
  • Theory of computation → Data structures design and analysis
Keywords
  • Positional Burrows-Wheeler Transform
  • r-index
  • minimal position substring cover
  • set-maximal exact matches

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Paola Bonizzoni, Christina Boucher, Davide Cozzi, Travis Gagie, Dominik Köppl, and Massimiliano Rossi. Data Structures for SMEM-Finding in the PBWT. In International Symposium on String Processing and Information Retrieval, pages 89-101. Springer, 2023. Google Scholar
  2. Davide Cozzi. muPBWT k-SMEM. Software, European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement PANGAIA No. 87253, swhId: https://archive.softwareheritage.org/swh:1:dir:d3467768a54423c8294abfc44f87f18705b3ed02;origin=https://github.com/dlcgold/muPBWT;visit=swh:1:snp:e69e661f21c323ba367bb685b5cd0da27a8b0462;anchor=swh:1:rev:9dc88c898146ae314ceb58aef9daab6ba27caa8c, (visited on 04/06/2024). URL: https://github.com/dlcgold/muPBWT/tree/k-smem.
  3. Davide Cozzi, Massimiliano Rossi, Simone Rubinacci, Travis Gagie, Dominik Köppl, Christina Boucher, and Paola Bonizzoni. μ-PBWT: a lightweight r-indexing of the PBWT for storing and querying UK Biobank data. Bioinformatics, 39(9):btad552, 2023. Google Scholar
  4. Richard Durbin. Efficient haplotype matching and storage using the positional Burrows-Wheeler transform (PBWT). Bioinformatics, 30(9):1266-1272, 2014. Google Scholar
  5. Travis Gagie, Gonzalo Navarro, and Nicola Prezza. Optimal-time text indexing in BWT-runs bounded space. In Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 1459-1477. SIAM, 2018. Google Scholar
  6. Travis Gagie, Gonzalo Navarro, and Nicola Prezza. Fully Functional Suffix Trees and Optimal Text Searching in BWT-Runs Bounded Space. Journal of the ACM, 67(1):2:1-2:54, 2020. URL: https://doi.org/10.1145/3375890.
  7. Na Li and Matthew Stephens. Modeling linkage disequilibrium and identifying recombination hotspots using single-nucleotide polymorphism data. Genetics, 165(4):2213-2233, 2003. Google Scholar
  8. Veli Mäkinen and Gonzalo Navarro. Succinct suffix arrays based on run-length encoding. In Combinatorial Pattern Matching: 16th Annual Symposium, CPM 2005, Jeju Island, Korea, June 19-22, 2005. Proceedings 16, pages 45-56. Springer, 2005. Google Scholar
  9. Massimiliano Rossi, Marco Oliva, Ben Langmead, Travis Gagie, and Christina Boucher. MONI: A Pangenomic Index for Finding Maximal Exact Matches. Journal of Computational Biology, 29(2):169-187, 2022. Google Scholar
  10. Ahsan Sanaullah, Degui Zhi, and Shaoije Zhang. Haplotype threading using the positional Burrows-Wheeler transform. In 22nd International Workshop on Algorithms in Bioinformatics (WABI 2022). Schloss Dagstuhl-Leibniz-Zentrum für Informatik, 2022. Google Scholar
  11. Ahsan Sanaullah, Degui Zhi, and Shaojie Zhang. Minimal positional substring cover is a haplotype threading alternative to Li and Stephens Model. Genome Research, 33(7):1007-1014, 2023. URL: https://doi.org/10.1101/gr.277673.123.
  12. Igor Tatarnikov, Ardavan Shahrabi Farahani, Sana Kashgouli, and Travis Gagie. MONI Can Find k-MEMs. In 34th Annual Symposium on Combinatorial Pattern Matching (CPM 2023). Schloss Dagstuhl-Leibniz-Zentrum für Informatik, 2023. Google Scholar
  13. The 1000 Genomes Project Consortium. A global reference for human genetic variation. Nature, 526:68-74, 2015. Google Scholar
  14. Naga Sai Kavya Vaddadi, Taher Mun, and Ben Langmead. Minimizing reference bias with an impute-first approach. bioRxiv, 2023. URL: https://doi.org/10.1101/2023.11.30.568362.