b-move: Faster Bidirectional Character Extensions in a Run-Length Compressed Index

Authors Lore Depuydt , Luca Renders , Simon Van de Vyver , Lennart Veys, Travis Gagie , Jan Fostier



PDF
Thumbnail PDF

File

LIPIcs.WABI.2024.10.pdf
  • Filesize: 0.84 MB
  • 18 pages

Document Identifiers

Author Details

Lore Depuydt
  • Ghent University - imec, Belgium
Luca Renders
  • Ghent University - imec, Belgium
Simon Van de Vyver
  • Ghent University, Belgium
Lennart Veys
  • Ghent University, Belgium
Travis Gagie
  • Dalhousie University, Halifax, Canada
Jan Fostier
  • Ghent University - imec, Belgium

Acknowledgements

The authors thank Ben Langmead, Nathaniel Brown, and Mohsen Zakeri for their helpful feedback and suggestions.

Cite AsGet BibTex

Lore Depuydt, Luca Renders, Simon Van de Vyver, Lennart Veys, Travis Gagie, and Jan Fostier. b-move: Faster Bidirectional Character Extensions in a Run-Length Compressed Index. In 24th International Workshop on Algorithms in Bioinformatics (WABI 2024). Leibniz International Proceedings in Informatics (LIPIcs), Volume 312, pp. 10:1-10:18, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2024)
https://doi.org/10.4230/LIPIcs.WABI.2024.10

Abstract

Due to the increasing availability of high-quality genome sequences, pan-genomes are gradually replacing single consensus reference genomes in many bioinformatics pipelines to better capture genetic diversity. Traditional bioinformatics tools using the FM-index face memory limitations with such large genome collections. Recent advancements in run-length compressed indices like Gagie et al.’s r-index and Nishimoto and Tabei’s move structure, alleviate memory constraints but focus primarily on backward search for MEM-finding. Arakawa et al.’s br-index initiates complete approximate pattern matching using bidirectional search in run-length compressed space, but with significant computational overhead due to complex memory access patterns. We introduce b-move, a novel bidirectional extension of the move structure, enabling fast, cache-efficient bidirectional character extensions in run-length compressed space. It achieves bidirectional character extensions up to 8 times faster than the br-index, closing the performance gap with FM-index-based alternatives, while maintaining the br-index’s favorable memory characteristics. For example, all available complete E. coli genomes on NCBI’s RefSeq collection can be compiled into a b-move index that fits into the RAM of a typical laptop. Thus, b-move proves practical and scalable for pan-genome indexing and querying. We provide a C++ implementation of b-move, supporting efficient lossless approximate pattern matching including locate functionality, available at https://github.com/biointec/b-move under the AGPL-3.0 license.

Subject Classification

ACM Subject Classification
  • Applied computing → Bioinformatics
Keywords
  • Pan-genomics
  • FM-index
  • r-index
  • Move Structure
  • Bidirectional Search
  • Approximate Pattern Matching
  • Lossless Alignment
  • Cache Efficiency

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Omar Ahmed, Massimiliano Rossi, Sam Kovaka, Michael C. Schatz, Travis Gagie, Christina Boucher, and Ben Langmead. Pan-genomic matching statistics for targeted nanopore sequencing. iScience, 24(6):102696, 2021. URL: https://doi.org/10.1016/j.isci.2021.102696.
  2. Omar Y. Ahmed, Massimiliano Rossi, Travis Gagie, Christina Boucher, and Ben Langmead. Spumoni 2: improved classification using a pangenome index of minimizer digests. Genome Biology, 24(1):122, May 2023. URL: https://doi.org/10.1186/s13059-023-02958-1.
  3. Yuma Arakawa, Gonzalo Navarro, and Kunihiko Sadakane. Bi-Directional r-Indexes. In 33rd Annual Symposium on Combinatorial Pattern Matching (CPM 2022), volume 223 of Leibniz International Proceedings in Informatics (LIPIcs), pages 11:1-11:14, Dagstuhl, Germany, 2022. Schloss Dagstuhl - Leibniz-Zentrum für Informatik. URL: https://doi.org/10.4230/LIPIcs.CPM.2022.11.
  4. Andrej Baláž, Travis Gagie, Adrián Goga, Simon Heumos, Gonzalo Navarro, Alessia Petescia, and Jouni Sirén. Wheeler maps. In LATIN 2024: Theoretical Informatics, pages 178-192, Cham, 2024. Springer Nature Switzerland. URL: https://doi.org/10.1007/978-3-031-55598-5_12.
  5. Christina Boucher, Travis Gagie, Alan Kuhnle, and Giovanni Manzini. Prefix-Free Parsing for Building Big BWTs. In Laxmi Parida and Esko Ukkonen, editors, 18th International Workshop on Algorithms in Bioinformatics (WABI 2018), volume 113 of Leibniz International Proceedings in Informatics (LIPIcs), pages 2:1-2:16, Dagstuhl, Germany, 2018. Schloss Dagstuhl - Leibniz-Zentrum für Informatik. URL: https://doi.org/10.4230/LIPIcs.WABI.2018.2.
  6. Christina Boucher, Travis Gagie, I Tomohiro, Dominik Köppl, Ben Langmead, Giovanni Manzini, Gonzalo Navarro, Alejandro Pacheco, and Massimiliano Rossi. Phoni: Streamed matching statistics with multi-genome references. In 2021 Data Compression Conference (DCC), pages 193-202, 2021. URL: https://doi.org/10.1109/DCC50243.2021.00027.
  7. Nathaniel K. Brown, Travis Gagie, and Massimiliano Rossi. RLBWT tricks. In Data Compression Conference, DCC 2022, Snowbird, UT, USA, March 22-25, 2022, page 444. IEEE, 2022. URL: https://doi.org/10.1109/DCC52660.2022.00055.
  8. Michael Burrows and David Wheeler. A Block-Sorting Lossless Data Compression Algorithm. Research Report 124, Digital Equipment Corporation Systems Research Center, 130 Lytton Avenue, Palo Alto, California 94301, May 1994. Google Scholar
  9. Dustin Cobas, Travis Gagie, and Gonzalo Navarro. A Fast and Small Subsampled R-Index. In 32nd Annual Symposium on Combinatorial Pattern Matching (CPM 2021), volume 191 of Leibniz International Proceedings in Informatics (LIPIcs), pages 13:1-13:16, Dagstuhl, Germany, 2021. Schloss Dagstuhl - Leibniz-Zentrum für Informatik. URL: https://doi.org/10.4230/LIPIcs.CPM.2021.13.
  10. The 1000 Genomes Project Consortium. A global reference for human genetic variation. Nature, 526(7571):68-74, October 2015. URL: https://doi.org/10.1038/nature15393.
  11. The Computational Pan-Genomics Consortium. Computational pan-genomics: status, promises and challenges. Briefings in Bioinformatics, 19(1):118-135, October 2016. URL: https://doi.org/10.1093/bib/bbw089.
  12. Lore Depuydt, Luca Renders, Thomas Abeel, and Jan Fostier. Pan-genome de bruijn graph using the bidirectional fm-index. BMC Bioinform., 24(1):400, 2023. URL: https://doi.org/10.1186/S12859-023-05531-6.
  13. P. Ferragina and G. Manzini. Opportunistic data structures with applications. In Proceedings 41st Annual Symposium on Foundations of Computer Science, pages 390-398, 2000. URL: https://doi.org/10.1109/SFCS.2000.892127.
  14. Travis Gagie, Gonzalo Navarro, and Nicola Prezza. Optimal-time text indexing in bwt-runs bounded space. In Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2018, New Orleans, LA, USA, January 7-10, 2018, pages 1459-1477. SIAM, 2018. URL: https://doi.org/10.1137/1.9781611975031.96.
  15. Travis Gagie, Gonzalo Navarro, and Nicola Prezza. Fully Functional Suffix Trees and Optimal Text Searching in BWT-Runs Bounded Space. J. ACM, 67(1), January 2020. URL: https://doi.org/10.1145/3375890.
  16. Adrián Goga, Lore Depuydt, Nathaniel K. Brown, Jan Fostier, Travis Gagie, and Gonzalo Navarro. Faster maximal exact matches with lazy lcp evaluation. In 2024 Data Compression Conference (DCC), pages 123-132, 2024. URL: https://doi.org/10.1109/DCC58796.2024.00020.
  17. Gregory Kucherov, Kamil Salikhov, and Dekel Tsur. Approximate String Matching Using a Bidirectional Index. In Combinatorial Pattern Matching, pages 222-231, Cham, 2014. Springer International Publishing. URL: https://doi.org/10.1007/978-3-319-07566-2_23.
  18. T. W. Lam, Ruiqiang Li, Alan Tam, Simon Wong, Edward Wu, and S. M. Yiu. High Throughput Short Read Alignment via Bi-directional BWT. In 2009 IEEE International Conference on Bioinformatics and Biomedicine, pages 31-36, 2009. URL: https://doi.org/10.1109/BIBM.2009.42.
  19. Ben Langmead and Steven L. Salzberg. Fast gapped-read alignment with Bowtie 2. Nature Methods, 9(4):357-359, April 2012. URL: https://doi.org/10.1038/nmeth.1923.
  20. Heng Li and Richard Durbin. Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics, 25(14):1754-1760, 2009. URL: https://doi.org/10.1093/bioinformatics/btp324.
  21. Veli Mäkinen and Gonzalo Navarro. Succinct suffix arrays based on run-length encoding. In Combinatorial Pattern Matching, pages 45-56, Berlin, Heidelberg, 2005. Springer Berlin Heidelberg. URL: https://doi.org/10.1007/11496656_5.
  22. Udi Manber and Gene Myers. Suffix Arrays: A New Method for On-Line String Searches. SIAM Journal on Computing, 22(5):935-948, 1993. URL: https://doi.org/10.1137/0222058.
  23. Takaaki Nishimoto and Yasuo Tabei. Optimal-time queries on bwt-runs compressed indexes. In 48th International Colloquium on Automata, Languages, and Programming, ICALP 2021, July 12-16, 2021, Glasgow, Scotland (Virtual Conference), volume 198 of LIPIcs, pages 101:1-101:15. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2021. URL: https://doi.org/10.4230/LIPICS.ICALP.2021.101.
  24. Luca Renders, Lore Depuydt, and Jan Fostier. Approximate Pattern Matching Using Search Schemes and In-Text Verification. In Bioinformatics and Biomedical Engineering, pages 419-435, Cham, 2022. Springer International Publishing. URL: https://doi.org/10.1007/978-3-031-07802-6_36.
  25. Luca Renders, Lore Depuydt, Sven Rahmann, and Jan Fostier. Automated design of efficient search schemes for lossless approximate pattern matching. In Research in Computational Molecular Biology, pages 164-184, Cham, 2024. Springer Nature Switzerland. URL: https://doi.org/10.1007/978-1-0716-3989-4_11.
  26. Luca Renders, Kathleen Marchal, and Jan Fostier. Dynamic partitioning of search patterns for approximate pattern matching using search schemes. iScience, 24(7):102687, 2021. URL: https://doi.org/10.1016/j.isci.2021.102687.
  27. Massimiliano Rossi, Marco Oliva, Ben Langmead, Travis Gagie, and Christina Boucher. MONI: A pangenomic index for finding maximal exact matches, 2022. URL: https://doi.org/10.1089/CMB.2021.0290.
  28. Julian Seward. bzip2 and libbzip2 - a program and library for data compression. avaliable at http://www.bzip.org, 1996.
  29. Mohsen Zakeri, Nathaniel K. Brown, Omar Y. Ahmed, Travis Gagie, and Ben Langmead. Movi: a fast and cache-efficient full-text pangenome index. bioRxiv, 2024. Accepted into RECOMB-seq 2024 proceedings track. URL: https://doi.org/10.1101/2023.11.04.565615.
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail