Acceleration of FM-Index Queries Through Prefix-Free Parsing

Authors Aaron Hong, Marco Oliva , Dominik Köppl , Hideo Bannai , Christina Boucher , Travis Gagie



PDF
Thumbnail PDF

File

LIPIcs.WABI.2023.13.pdf
  • Filesize: 1.02 MB
  • 16 pages

Document Identifiers

Author Details

Aaron Hong
  • Department of Computer and Information Science and Engineering, Herbert Wertheim College of Engineering, University of Florida, Gainesville, FL, USA
Marco Oliva
  • Department of Computer and Information Science and Engineering, Herbert Wertheim College of Engineering, University of Florida, Gainesville, FL, USA
Dominik Köppl
  • Institut für Informatik, Univeristät Münster, Germany
Hideo Bannai
  • M&D Data Science Center, Tokyo Medical and Dental University, Japan
Christina Boucher
  • Department of Computer and Information Science and Engineering, Herbert Wertheim College of Engineering, University of Florida, Gainesville, FL, USA
Travis Gagie
  • Faculty of Computer Science, Dalhousie University, Halifax, Canada

Cite AsGet BibTex

Aaron Hong, Marco Oliva, Dominik Köppl, Hideo Bannai, Christina Boucher, and Travis Gagie. Acceleration of FM-Index Queries Through Prefix-Free Parsing. In 23rd International Workshop on Algorithms in Bioinformatics (WABI 2023). Leibniz International Proceedings in Informatics (LIPIcs), Volume 273, pp. 13:1-13:16, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2023)
https://doi.org/10.4230/LIPIcs.WABI.2023.13

Abstract

FM-indexes are a crucial data structure in DNA alignment, but searching with them usually takes at least one random access per character in the query pattern. Ferragina and Fischer [Ferragina and Fischer, 2007] observed in 2007 that word-based indexes often use fewer random accesses than character-based indexes, and thus support faster searches. Since DNA lacks natural word-boundaries, however, it is necessary to parse it somehow before applying word-based FM-indexing. Last year, Deng et al. [Deng et al., 2022] proposed parsing genomic data by induced suffix sorting, and showed the resulting word-based FM-indexes support faster counting queries than standard FM-indexes when patterns are a few thousand characters or longer. In this paper we show that using prefix-free parsing - which takes parameters that let us tune the average length of the phrases - instead of induced suffix sorting, gives a significant speedup for patterns of only a few hundred characters. We implement our method and demonstrate it is between 3 and 18 times faster than competing methods on queries to GRCh38. And was consistently faster on queries made to 25,000, 50,000 and 100,000 SARS-CoV-2 genomes. Hence, it is very clear that our method accelerates the performance of count over all state-of-the-art methods with a minor increase in the memory.

Subject Classification

ACM Subject Classification
  • Theory of computation → Pattern matching
Keywords
  • FM-index
  • pangenomics
  • scalability
  • word-based indexing
  • random access

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Tooru Akagi, Dominik Köppl, Yuto Nakashima, Shunsuke Inenaga, Hideo Bannai, and Masayuki Takeda. Grammar index by induced suffix sorting. In Proceedings of the 28th International Symposium on String Processing and Information Retrieval (SPIRE), pages 85-99, 2021. Google Scholar
  2. Christina Boucher, Travis Gagie, Alan Kuhnle, Ben Langmead, Giovanni Manzini, and Taher Mun. Prefix-free parsing for building big BWTs. Algorithms Molecular Biology, 14(1):13:1-13:15, 2019. Google Scholar
  3. Christina Boucher, Travis Gagie, Alan Kuhnle, and Giovanni Manzini. Prefix-free parsing for building big BWTs. In Proceedings of the Workshop of Algorithms in Biology (WABI), pages 2:1-2:16, 2018. Google Scholar
  4. Jin-Jie Deng, Wing-Kai Hon, Dominik Köppl, and Kunihiko Sadakane. FM-indexing grammars induced by suffix sorting for long patterns. In Proceedings of the IEEE Data Compression Conference (DCC), pages 63-72, 2022. Google Scholar
  5. Paola Ferragina and Giovanni Manzini. Indexing Compressed Text. Journal of the ACM, 52:552-581, 2005. Google Scholar
  6. Paolo Ferragina and Johannes Fischer. Suffix arrays on words. In Proceedings of the 18th Annual Symposium Combinatorial Pattern Matching (CPM), pages 328-339, 2007. Google Scholar
  7. Travis Gagie, Gonzalo Navarro, and Nicola Prezza. Fully Functional Suffix Trees and Optimal Text Searching in BWT-Runs Bounded Space. Journal of the ACM, 67(1):1-54, 2020. Google Scholar
  8. S Gog, T Beller, A Moffat, and M Petri. From Theory to Practice: Plug and Play with Succinct Data Structures. In Proceedings of the 13th Symposium on Experimental Algorithms (SEA), pages 326-337, 2014. Google Scholar
  9. Simon Gog, Juha Kärkkäinen, Dominik Kempa, Matthias Petri, and Simon J Puglisi. Fixed block compression boosting in fm-indexes: Theory and practice. Algorithmica, 81:1370-1391, 2019. Google Scholar
  10. Ben Langmead, Cole Trapnell, Mihai Pop, and Steven L. Salzberg. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biology, 10(3):R25-R25, 2009. Google Scholar
  11. Heng Li. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. CoRR, 2013. URL: https://arxiv.org/abs/1303.3997.
  12. Veli Mäkinen and Gonzalo Navarro. Run-length FM-index. In Proceedings of the DIMACS Workshop: "The Burrows-Wheeler Transform: Ten Years Later", pages 17-19, 2004. Google Scholar
  13. Veli Mäkinen and Gonzalo Navarro. Succinct suffix arrays based on run-length encoding. In Proceedings of the Annual Symposium on Combinatorial Pattern Matching, pages 45-56, 2005. Google Scholar
  14. Udi Manber and Gene W. Myers. Suffix Arrays: A New Method for On-line String Searches. SIAM Journal on Computing, 22(5):935-948, 1993. Google Scholar
  15. Felix Mölder, Kim Philipp Jablonski, Brice Letcher, Michael B Hall, Christopher H Tomkins-Tinch, Vanessa Sochat, Jan Forster, Soohyun Lee, Sven O Twardziok, Alexander Kanitz, et al. Sustainable data analysis with Snakemake. F1000Research, 10, 2021. Google Scholar
  16. Taher Mun, Alan Kuhnle, Christina Boucher, Travis Gagie, Ben Langmead, and Giovanni Manzini. Matching Reads to Many Genomes with the r-Index. Journal of Computational Biology, 27(4):514-518, 2020. URL: https://doi.org/10.1089/cmb.2019.0316.
  17. Ge Nong, Sen Zhang, and Wai Hong Chan. Two efficient algorithms for linear time suffix array construction. IEEE Trans. Computers, 60(10):1471-1484, 2011. URL: https://doi.org/10.1109/TC.2010.188.
  18. Daniel Saad Nogueira Nunes, Felipe Alves da Louza, Simon Gog, Mauricio Ayala-Rincón, and Gonzalo Navarro. A grammar compression algorithm based on induced suffix sorting. In Proc. DCC, pages 42-51, 2018. URL: https://doi.org/10.1109/DCC.2018.00012.
  19. Jouni Siren. Compressed suffix arrays for massive data. In Proceedings of the 16th International Symposium String Processing and Information Retrieval (SPIRE), pages 63-74, 2009. Google Scholar
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail