Move-r: Optimizing the r-index

Authors Nico Bertram, Johannes Fischer , Lukas Nalbach



PDF
Thumbnail PDF

File

LIPIcs.SEA.2024.1.pdf
  • Filesize: 1.06 MB
  • 19 pages

Document Identifiers

Author Details

Nico Bertram
  • Technische Universität Dortmund, Germany
Johannes Fischer
  • Technische Universität Dortmund, Germany
Lukas Nalbach
  • Technische Universität Dortmund, Germany

Cite AsGet BibTex

Nico Bertram, Johannes Fischer, and Lukas Nalbach. Move-r: Optimizing the r-index. In 22nd International Symposium on Experimental Algorithms (SEA 2024). Leibniz International Proceedings in Informatics (LIPIcs), Volume 301, pp. 1:1-1:19, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2024)
https://doi.org/10.4230/LIPIcs.SEA.2024.1

Abstract

We present a static text index called Move-r, which is a highly optimized version of the r-index ([Travis Gagie et al., 2020] Gagie et al., 2020) that encorporates recent theoretical developments of the move data structure ([Takaaki Nishimoto and Yasuo Tabei, 2021] Nishimoto and Tabei, 2021). The r-index is the method of choice for indexing highly repetitive texts, such as different versions of a text document or DNA from the same species, as it exploits the compressibilty of the underlying data. With Move-r, we can answer count- and locate queries 2-35 (typically 15) times as fast as with any other r-index supporting locate queries while being 0.8-2.5 (typically 2) times as large. A Move-r index can be constructed 0.9-2 (typically 2) times as fast while using 1/3-1 (typically 1/2) times as much space.

Subject Classification

ACM Subject Classification
  • Theory of computation → Pattern matching
Keywords
  • Compressed Text Index
  • Burrows-Wheeler Transform

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Michael Axtmann, Sascha Witt, Daniel Ferizovic, and Peter Sanders. Engineering in-place (shared-memory) sorting algorithms. ACM Transactions on Parallel Computing, 9(1):2:1-2:62, 2022. Google Scholar
  2. Hideo Bannai, Travis Gagie, and Tomohiro I. Refining the r-index. Theoretical Computer Science, 812:96-108, 2020. Google Scholar
  3. Djamal Belazzougui and Gonzalo Navarro. Optimal lower and upper bounds for representing sequences. ACM Transactions on Algorithms, 11(4):31:1-31:21, 2015. Google Scholar
  4. Christina Boucher, Travis Gagie, Alan Kuhnle, and Giovanni Manzini. Prefix-free parsing for building big bwts. In 18th International Workshop on Algorithms in Bioinformatics WABI, pages 2:1-2:16, 2018. Google Scholar
  5. Nathaniel K. Brown, Travis Gagie, and Massimiliano Rossi. RLBWT tricks. In 20th International Symposium on Experimental Algorithms SEA, pages 16:1-16:16, 2022. Google Scholar
  6. Michael Burrows and David J. Wheeler. A block-sorting lossless data compression algorithm. In Technical Report DEC, volume 124, 1994. Google Scholar
  7. David Richard Clark. Compact Pat Trees. PhD thesis, University of Waterloo, CAN, 1998. Google Scholar
  8. Diego Díaz-Domínguez, Saska Dönges, Simon J. Puglisi, and Leena Salmela. Simple runs-bounded fm-index designs are fast. In 21st International Symposium on Experimental Algorithms SEA, pages 7:1-7:16, 2023. Google Scholar
  9. Diego Díaz-Domínguez and Gonzalo Navarro. Efficient construction of the BWT for repetitive text using string compression. In 33rd Annual Symposium on Combinatorial Pattern Matching CPM, pages 29:1-29:18, 2022. Google Scholar
  10. Paolo Ferragina and Giovanni Manzini. Indexing compressed text. Journal of the ACM, 52(4):552-581, 2005. Google Scholar
  11. Travis Gagie, Gonzalo Navarro, and Nicola Prezza. Fully functional suffix trees and optimal text searching in bwt-runs bounded space. Journal of the ACM, 67(1):2:1-2:54, 2020. Google Scholar
  12. Simon Gog, Timo Beller, Alistair Moffat, and Matthias Petri. From theory to practice: Plug and play with succinct data structures. In 13th International Symposium on Experimental Algorithms SEA, pages 326-337, 2014. Google Scholar
  13. Torben Hagerup. Sorting and searching on the word RAM. In 15th Annual Symposium on Theoretical Aspects of Computer Science STACS, Lecture Notes in Computer Science, pages 366-398, 1998. Google Scholar
  14. Udi Manber and Gene Myers. Suffix arrays: A new method for on-line string searches. In 1st Annual ACM-SIAM Symposium on Discrete Algorithms, pages 319-327, 1990. Google Scholar
  15. Taher Mun, Alan Kuhnle, Christina Boucher, Travis Gagie, Ben Langmead, and Giovanni Manzini. Matching reads to many genomes with the r-index. Journal of Computational Biology, 27(4):514-518, 2020. Google Scholar
  16. Gonzalo Navarro. Compact Data Structures - A Practical Approach. Cambridge University Press, 2016. Google Scholar
  17. Gonzalo Navarro. Indexing highly repetitive string collections, part I: repetitiveness measures. ACM Computing Surveys, 54(2):29:1-29:31, 2022. Google Scholar
  18. Takaaki Nishimoto, Shunsuke Kanda, and Yasuo Tabei. An optimal-time RLBWT construction in bwt-runs bounded space. In 49th International Colloquium on Automata, Languages, and Programming ICALP, pages 99:1-99:20, 2022. Google Scholar
  19. Takaaki Nishimoto and Yasuo Tabei. Optimal-time queries on bwt-runs compressed indexes. In 48th International Colloquium on Automata, Languages, and Programming ICALP, pages 101:1-101:15, 2021. Google Scholar
  20. Daisuke Okanohara and Kunihiko Sadakane. Practical entropy-compressed rank/select dictionary. In 9th Workshop on Algorithm Engineering and Experiments ALENEX, pages 60-70, 2007. Google Scholar
  21. Anna Pagh, Rasmus Pagh, and Milan Ruzic. Linear probing with 5-wise independence. SIAM Review, 53(3):547-558, 2011. Google Scholar
  22. Nicola Prezza. A framework of dynamic data structures for string processing. In 16th International Symposium on Experimental Algorithms SEA, pages 11:1-11:15, 2017. Google Scholar
  23. Sebastiano Vigna. Broadword implementation of rank/select queries. In 7th International Workshop on Experimental Algorithms WEA, pages 154-168, 2008. Google Scholar