Acceleration of FM-Index Queries Through Prefix-Free Parsing

Hong, Aaron; Oliva, Marco; Köppl, Dominik; Bannai, Hideo; Boucher, Christina; Gagie, Travis

doi:10.4230/LIPIcs.WABI.2023.13

Abstract

FM-indexes are a crucial data structure in DNA alignment, but searching with them usually takes at least one random access per character in the query pattern. Ferragina and Fischer [Ferragina and Fischer, 2007] observed in 2007 that word-based indexes often use fewer random accesses than character-based indexes, and thus support faster searches. Since DNA lacks natural word-boundaries, however, it is necessary to parse it somehow before applying word-based FM-indexing. Last year, Deng et al. [Deng et al., 2022] proposed parsing genomic data by induced suffix sorting, and showed the resulting word-based FM-indexes support faster counting queries than standard FM-indexes when patterns are a few thousand characters or longer. In this paper we show that using prefix-free parsing - which takes parameters that let us tune the average length of the phrases - instead of induced suffix sorting, gives a significant speedup for patterns of only a few hundred characters. We implement our method and demonstrate it is between 3 and 18 times faster than competing methods on queries to GRCh38. And was consistently faster on queries made to 25,000, 50,000 and 100,000 SARS-CoV-2 genomes. Hence, it is very clear that our method accelerates the performance of count over all state-of-the-art methods with a minor increase in the memory.

Cite As Get BibTex

Aaron Hong, Marco Oliva, Dominik Köppl, Hideo Bannai, Christina Boucher, and Travis Gagie. Acceleration of FM-Index Queries Through Prefix-Free Parsing. In 23rd International Workshop on Algorithms in Bioinformatics (WABI 2023). Leibniz International Proceedings in Informatics (LIPIcs), Volume 273, pp. 13:1-13:16, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2023) https://doi.org/10.4230/LIPIcs.WABI.2023.13

Author Details

Aaron Hong

Department of Computer and Information Science and Engineering, Herbert Wertheim College of Engineering, University of Florida, Gainesville, FL, USA

Marco Oliva

Department of Computer and Information Science and Engineering, Herbert Wertheim College of Engineering, University of Florida, Gainesville, FL, USA

Dominik Köppl

Institut für Informatik, Univeristät Münster, Germany

Hideo Bannai

M&D Data Science Center, Tokyo Medical and Dental University, Japan

Christina Boucher

Department of Computer and Information Science and Engineering, Herbert Wertheim College of Engineering, University of Florida, Gainesville, FL, USA

Travis Gagie

Faculty of Computer Science, Dalhousie University, Halifax, Canada

Funding

Hong, Aaron: NIH/NHGRI grant R01HG011392 to Ben Langmead, NSF/BIO grant DBI-2029552 to Christina Boucher
Oliva, Marco: NIH/NHGRI grant R01HG011392 to Ben Langmead, NSF/BIO grant DBI-2029552 to Christina Boucher
Köppl, Dominik: JSPS KAKENHI Grant Number JP21K17701, JP22H03551, and JP23H04378
Bannai, Hideo: JSPS KAKENHI Grant Number JP20H04141
Boucher, Christina: NIH/NHGRI grant R01HG011392 to Ben Langmead, NSF/BIO grant DBI-2029552 to Christina Boucher, NSF/SCH grant INT-2013998 to Christina Boucher, and NIH/NIAID grant R01AI14180 to Christina Boucher
Gagie, Travis: NIH/NHGRI grant R01HG011392 to Ben Langmead, NSERC grant RGPIN-07185-2020 to Travis Gagie, NSF/BIO grant DBI-2029552 to Christina Boucher

Supplementary Materials

Software (Source Code of PFP-FM) https://github.com/marco-oliva/afm

References

Tooru Akagi, Dominik Köppl, Yuto Nakashima, Shunsuke Inenaga, Hideo Bannai, and Masayuki Takeda. Grammar index by induced suffix sorting. In Proceedings of the 28th International Symposium on String Processing and Information Retrieval (SPIRE), pages 85-99, 2021.
Christina Boucher, Travis Gagie, Alan Kuhnle, Ben Langmead, Giovanni Manzini, and Taher Mun. Prefix-free parsing for building big BWTs. Algorithms Molecular Biology, 14(1):13:1-13:15, 2019.
Christina Boucher, Travis Gagie, Alan Kuhnle, and Giovanni Manzini. Prefix-free parsing for building big BWTs. In Proceedings of the Workshop of Algorithms in Biology (WABI), pages 2:1-2:16, 2018.
Jin-Jie Deng, Wing-Kai Hon, Dominik Köppl, and Kunihiko Sadakane. FM-indexing grammars induced by suffix sorting for long patterns. In Proceedings of the IEEE Data Compression Conference (DCC), pages 63-72, 2022.
Paola Ferragina and Giovanni Manzini. Indexing Compressed Text. Journal of the ACM, 52:552-581, 2005.
Paolo Ferragina and Johannes Fischer. Suffix arrays on words. In Proceedings of the 18th Annual Symposium Combinatorial Pattern Matching (CPM), pages 328-339, 2007.
Travis Gagie, Gonzalo Navarro, and Nicola Prezza. Fully Functional Suffix Trees and Optimal Text Searching in BWT-Runs Bounded Space. Journal of the ACM, 67(1):1-54, 2020.
S Gog, T Beller, A Moffat, and M Petri. From Theory to Practice: Plug and Play with Succinct Data Structures. In Proceedings of the 13th Symposium on Experimental Algorithms (SEA), pages 326-337, 2014.
Simon Gog, Juha Kärkkäinen, Dominik Kempa, Matthias Petri, and Simon J Puglisi. Fixed block compression boosting in fm-indexes: Theory and practice. Algorithmica, 81:1370-1391, 2019.
Ben Langmead, Cole Trapnell, Mihai Pop, and Steven L. Salzberg. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biology, 10(3):R25-R25, 2009.
Heng Li. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. CoRR, 2013. URL: https://arxiv.org/abs/1303.3997.
Veli Mäkinen and Gonzalo Navarro. Run-length FM-index. In Proceedings of the DIMACS Workshop: "The Burrows-Wheeler Transform: Ten Years Later", pages 17-19, 2004.
Veli Mäkinen and Gonzalo Navarro. Succinct suffix arrays based on run-length encoding. In Proceedings of the Annual Symposium on Combinatorial Pattern Matching, pages 45-56, 2005.
Udi Manber and Gene W. Myers. Suffix Arrays: A New Method for On-line String Searches. SIAM Journal on Computing, 22(5):935-948, 1993.
Felix Mölder, Kim Philipp Jablonski, Brice Letcher, Michael B Hall, Christopher H Tomkins-Tinch, Vanessa Sochat, Jan Forster, Soohyun Lee, Sven O Twardziok, Alexander Kanitz, et al. Sustainable data analysis with Snakemake. F1000Research, 10, 2021.
Taher Mun, Alan Kuhnle, Christina Boucher, Travis Gagie, Ben Langmead, and Giovanni Manzini. Matching Reads to Many Genomes with the r-Index. Journal of Computational Biology, 27(4):514-518, 2020. URL: https://doi.org/10.1089/cmb.2019.0316.
Ge Nong, Sen Zhang, and Wai Hong Chan. Two efficient algorithms for linear time suffix array construction. IEEE Trans. Computers, 60(10):1471-1484, 2011. URL: https://doi.org/10.1109/TC.2010.188.
Daniel Saad Nogueira Nunes, Felipe Alves da Louza, Simon Gog, Mauricio Ayala-Rincón, and Gonzalo Navarro. A grammar compression algorithm based on induced suffix sorting. In Proc. DCC, pages 42-51, 2018. URL: https://doi.org/10.1109/DCC.2018.00012.
Jouni Siren. Compressed suffix arrays for massive data. In Proceedings of the 16th International Symposium String Processing and Information Retrieval (SPIRE), pages 63-74, 2009.

Acceleration of FM-Index Queries Through Prefix-Free Parsing

Authors Aaron Hong, Marco Oliva , Dominik Köppl , Hideo Bannai , Christina Boucher , Travis Gagie

File

Document Identifiers

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

References

Thanks for your feedback!

Could not send message