Simple Runs-Bounded FM-Index Designs Are Fast

Díaz-Domínguez, Diego; Dönges, Saska; Puglisi, Simon J.; Salmela, Leena

doi:10.4230/LIPIcs.SEA.2023.7

File

LIPIcs.SEA.2023.7.pdf

Filesize: 1.03 MB
16 pages

Document Identifiers

DOI: 10.4230/LIPIcs.SEA.2023.7
URN: urn:nbn:de:0030-drops-183579

Author Details

Diego Díaz-Domínguez

Department of Computer Science, University of Helsinki, Finland

Saska Dönges

Department of Computer Science, University of Helsinki, Finland

Simon J. Puglisi

Helsinki Institute for Information Technology (HIIT), Finland
Department of Computer Science, University of Helsinki, Finland

Leena Salmela

Department of Computer Science, University of Helsinki, Finland

Acknowledgements

The authors wish to thank the Finnish Computing Competence Infrastructure (FCCI) for supporting this project with computational and data storage resources.

Cite AsGet BibTex

Diego Díaz-Domínguez, Saska Dönges, Simon J. Puglisi, and Leena Salmela. Simple Runs-Bounded FM-Index Designs Are Fast. In 21st International Symposium on Experimental Algorithms (SEA 2023). Leibniz International Proceedings in Informatics (LIPIcs), Volume 265, pp. 7:1-7:16, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2023)
https://doi.org/10.4230/LIPIcs.SEA.2023.7

Abstract

Given a string X of length n on alphabet σ, the FM-index data structure allows counting all occurrences of a pattern P of length m in O(m) time via an algorithm called backward search. An important difficulty when searching with an FM-index is to support queries on L, the Burrows-Wheeler transform of X, while L is in compressed form. This problem has been the subject of intense research for 25 years now. Run-length encoding of L is an effective way to reduce index size, in particular when the data being indexed is highly-repetitive, which is the case in many types of modern data, including those arising from versioned document collections and in pangenomics. This paper takes a back-to-basics look at supporting backward search in FM-indexes, exploring and engineering two simple designs. The first divides the BWT string into blocks containing b symbols each and then run-length compresses each block separately, possibly introducing new runs (compared to applying run-length encoding once, to the whole string). Each block stores counts of each symbol that occurs before the block. This method supports the operation rank_c(L, i) (i.e., count the number of times c occurs in the prefix L[1..i]) by first determining the block i/b in which i falls and scanning the block to the appropriate position counting occurrences of c along the way. This partial answer to rank_c(L, i) is then added to the stored count of c symbols before the block to determine the final answer. Our second design has a similar structure, but instead divides the run-length-encoded version of L into blocks containing an equal number of runs. The trick then is to determine the block in which a query falls, which is achieved via a predecessor query over the block starting positions. We show via extensive experiments on a wide range of repetitive text collections that these FM-indexes are not only easy to implement, but also fast and space efficient in practice.

Subject Classification

ACM Subject Classification

Theory of computation → Design and analysis of algorithms

Keywords

data structures
efficient algorithms

Metrics

Access Statistics
Total Accesses (updated on a weekly basis)

0

PDF Downloads

0

Metadata Views

References

Markus J. Bauer, Anthony J. Cox, and Giovanna Rosone. Lightweight algorithms for constructing and inverting the BWT of string collections. Theoretical Computer Science, 483:134-148, 2013.
Djamal Belazzougui, Fabio Cunial, Travis Gagie, Nicola Prezza, and Mathieu Raffinot. Flexible indexing of repetitive collections. In Proc. 13th Conference on Computability in Europe (CiE), pages 162-174, 2017.
Djamal Belazzougui and Gonzalo Navarro. Alphabet-independent compressed text indexing. ACM Transactions on Algorithms, 10(4):23:1-23:19, 2014.
Djamal Belazzougui and Gonzalo Navarro. Optimal lower and upper bounds for representing sequences. ACM Transactions on Algorithms, 11(4), 2015.
Nathaniel K. Brown, Travis Gagie, and Massimiliano Rossi. RLBWT tricks. In Proc. 20th International Symposium on Experimental Algorithms (SEA), page article 16, 2022.
Douglas Comer. Ubiquitous B-Tree. ACM Computing Surveys, 11(2):121-137, 1979.
Diego Díaz-Domínguez and Gonzalo Navarro. Efficient construction of the BWT for repetitive text using string compression. In Proc. 33rd Annual Symposium on Combinatorial Pattern Matching (CPM), volume 223, page article 29, 2022.
Dirk D. Dolle, Zhicheng Liu, Matthew Cotten, Jared T. Simpson, Zamin Iqbal, Richard Durbin, Shane A. McCarthy, and Thomas M. Keane. Using reference-free compressed data structures to analyze sequencing reads from thousands of human genomes. Genome Research, 27(2):300-309, 2017.
Jana Ebler, Peter Ebert, Wayne E. Clarke, Tobias Rausch, Peter A. Audano, Torsten Houwaart, Yafei Mao, Jan O. Korbel, Evan E. Eichler, Michael C. Zody, et al. Pangenome-based genome inference allows efficient and accurate genotyping across a wide spectrum of variant classes. Nature Genetics, 54(4):518-525, 2022.
Paolo Ferragina and Giovanni Manzini. Opportunistic data structures with applications. In Proc. 41st Annual Symposium on Foundations of Computer Science (FOCS), pages 390-398, 2000.
Travis Gagie, Gonzalo Navarro, and Nicola Prezza. Fully-functional suffix trees and optimal text searching in BWT-runs bounded space. Journal of the ACM, 67(1):article 2, 2020.
Simon Gog, Timo Beller, Alistair Moffat, and Matthias Petri. From theory to practice: Plug and play with succinct data structures. In Proc. 13th International Symposium on Experimental Algorithms, (SEA), pages 326-337, 2014. URL: https://doi.org/10.1007/978-3-319-07959-2_28.
Simon Gog, Juha Kärkkäinen, Dominik Kempa, Matthias Petri, and Simon J. Puglisi. Fixed block compression boosting in FM-indexes: Theory and practice. Algorithmica, 81(4):1370-1391, 2019.
Simon Gog and Matthias Petri. Optimized succinct data structures for massive data. Software: Practice and Experience, 44(11):1287-1314, 2014.
Simon Gog and Matthias Petri. Optimized succinct data structures for massive data. Softw. Pract. Exp., 44(11):1287-1314, 2014.
Juha Kärkkäinen, Dominik Kempa, and Simon J. Puglisi. Hybrid compression of bitvectors for the FM-index. In Proc. 14th Data Compression Conference (DCC), pages 302-311, 2014.
Juha Kärkkäinen, Dominik Kempa, and Simon J. Puglisi. Lempel-Ziv parsing in external memory. In Proc. 24th Data Compression Conference (DCC), pages 153-162, 2014.
Ben Langmead and Steven L Salzberg. Fast gapped-read alignment with bowtie 2. Nature Methods, 9(4):357-359, 2012.
V. Mäkinen and G. Navarro. Succinct suffix arrays based on run-length encoding. Nordic Journal of Computing, 12(1):40-66, 2005.
Veli Mäkinen, Djamal Belazzougui, Fabio Cunial, and Alexandru I. Tomescu. Genome-Scale Algorithm Design: Biological Sequence Analysis in the Era of High-Throughput Sequencing. Cambridge University Press, 2015.
Udi Manber and Gene Myers. Suffix arrays: a new method for on-line string searches. SIAM Journal on Computing, 22(5):935-948, 1993.
Giovanni Manzini. An analysis of the burrows-wheeler transform. J. ACM, 48(3):407-430, 2001.
G. Navarro. Indexing highly repetitive string collections, part II: Compressed indexes. ACM Computing Surveys, 54(2):article 26, 2021.
G. Navarro and V. Mäkinen. Compressed full-text indexes. ACM Computing Surveys, 39(1):article 2, 2007.
Takaaki Nishimoto and Yasuo Tabei. Optimal-time queries on BWT-runs compressed indexes. In Proc. 48th International Colloquium on Automata, Languages, and Programming (ICALP), page article 101, 2021.
Rajeev Raman, Venkatesh Raman, and S. Srinivasa Rao. Succinct indexable dictionaries with applications to encoding k-ary trees and multisets. In Proc. 13th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 233-242, 2002.
Peter Sanders and Sebastian Winkel. Super scalar sample sort. In Proc. 12th European Symposium on Algorithms (ESA), pages 784-796, 2004.
Zachary D. Stephens, Skylar Y. Lee, Faraz Faghri, Roy H. Campbell, Chengxiang Zhai, Miles J. Efron, Ravishankar Iyer, Michael C. Schatz, Saurabh Sinha, and Gene E. Robinson. Big data: astronomical or genomical? PLoS Biology, 13(7):e1002195, 2015.
Sebastiano Vigna. Quasi-succinct indices. In Proc. Sixth ACM International Conference on Web Search and Data Mining (WSDM), pages 83-92. ACM, 2013.

Simple Runs-Bounded FM-Index Designs Are Fast

Authors Diego Díaz-Domínguez, Saska Dönges, Simon J. Puglisi , Leena Salmela

File

Document Identifiers

Author Details

Acknowledgements

Cite AsGet BibTex

Abstract

Subject Classification

ACM Subject Classification

Keywords

Metrics

References

Thanks for your feedback!

Could not send message

Simple Runs-Bounded FM-Index Designs Are Fast

Authors Diego Díaz-Domínguez, Saska Dönges, Simon J. Puglisi , Leena Salmela

File

Document Identifiers

Author Details

Funding

Acknowledgements

Cite AsGet BibTex

Abstract

Subject Classification

ACM Subject Classification

Keywords

Metrics

Supplementary Materials

References

Thanks for your feedback!

Could not send message