A Fast and Small Subsampled R-Index

Authors Dustin Cobas , Travis Gagie , Gonzalo Navarro



PDF
Thumbnail PDF

File

LIPIcs.CPM.2021.13.pdf
  • Filesize: 1.38 MB
  • 16 pages

Document Identifiers

Author Details

Dustin Cobas
  • CeBiB - Center for Biotechnology and Bioengineering, Santiago, Chile
  • Dept. of Computer Science, University of Chile, Santiago, Chile
Travis Gagie
  • CeBiB - Center for Biotechnology and Bioengineering, Santiago, Chile
  • Dalhousie University, Halifax, Canada
Gonzalo Navarro
  • CeBiB - Center for Biotechnology and Bioengineering, Santiago, Chile
  • Dept. of Computer Science, University of Chile, Santiago, Chile

Cite AsGet BibTex

Dustin Cobas, Travis Gagie, and Gonzalo Navarro. A Fast and Small Subsampled R-Index. In 32nd Annual Symposium on Combinatorial Pattern Matching (CPM 2021). Leibniz International Proceedings in Informatics (LIPIcs), Volume 191, pp. 13:1-13:16, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021)
https://doi.org/10.4230/LIPIcs.CPM.2021.13

Abstract

The r-index (Gagie et al., JACM 2020) represented a breakthrough in compressed indexing of repetitive text collections, outperforming its alternatives by orders of magnitude. Its space usage, 𝒪(r) where r is the number of runs in the Burrows-Wheeler Transform of the text, is however larger than Lempel-Ziv and grammar-based indexes, and makes it uninteresting in various real-life scenarios of milder repetitiveness. In this paper we introduce the sr-index, a variant that limits a large fraction of the space to 𝒪(min(r,n/s)) for a text of length n and a given parameter s, at the expense of multiplying by s the time per occurrence reported. The sr-index is obtained by carefully subsampling the text positions indexed by the r-index, in a way that we prove is still able to support pattern matching with guaranteed performance. Our experiments demonstrate that the sr-index sharply outperforms virtually every other compressed index on repetitive texts, both in time and space, even matching the performance of the r-index while using 1.5-3.0 times less space. Only some Lempel-Ziv-based indexes achieve better compression than the sr-index, using about half the space, but they are an order of magnitude slower.

Subject Classification

ACM Subject Classification
  • Theory of computation → Pattern matching
Keywords
  • Pattern matching
  • r-index
  • compressed text indexing
  • repetitive text collections

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Christina Boucher, Ondrej Cvacho, Travis Gagie, Jan Holub, Giovanni Manzini, Gonzalo Navarro, and Massimiliano Rossi. PFP compressed suffix trees. In Proc. 23rd Workshop on Algorithm Engineering and Experiments (ALENEX), pages 60-72, 2021. Google Scholar
  2. Christina Boucher, Travis Gagie, Alan Kuhnle, Ben Langmead, Giovanni Manzini, and Taher Mun. Prefix-free parsing for building big BWTs. Algorithms for Molecular Biology, 14(1):13:1-13:15, 2019. Google Scholar
  3. Michael Burrows and David J. Wheeler. A block-sorting lossless data compression algorithm. Technical Report 124, Digital Equipment Corporation, 1994. Google Scholar
  4. David R. Clark. Compact PAT Trees. PhD thesis, University of Waterloo, Canada, 1996. Google Scholar
  5. Francisco Claude, Gonzalo Navarro, and Alejandro Pacheco. Grammar-compressed indexes with logarithmic search time. Journal of Computer and System Sciences, 118:53-74, 2021. Google Scholar
  6. Diego Díaz-Domínguez and Gonzalo Navarro. A grammar compressor for collections of reads with applications to the construction of the BWT. In Proc. 31st Data Compression Conference (DCC), pages 93-102, 2021. Google Scholar
  7. Héctor Ferrada, Dominik Kempa, and Simon J. Puglisi. Hybrid indexing revisited. In Proc. 20th Workshop on Algorithm Engineering and Experiments (ALENEX), pages 1-8, 2018. Google Scholar
  8. Paolo Ferragina and Giovanni Manzini. Indexing Compressed Text. Journal of the ACM, 52(4):552-581, 2005. Google Scholar
  9. Paolo Ferragina, Giovanni Manzini, Veli Mäkinen, and Gonzalo Navarro. Compressed Representations of Sequences and Full-text Indexes. ACM Transactions on Algorithms, 3(2), 2007. Google Scholar
  10. Travis Gagie and Gonzalo Navarro. Compressed Indexes for Repetitive Textual Datasets, pages 475-480. Springer, 2019. Google Scholar
  11. Travis Gagie, Gonzalo Navarro, and Nicola Prezza. Fully-functional suffix trees and optimal text searching in BWT-runs bounded space. Journal of the ACM, 67(1):article 2, 2020. Google Scholar
  12. Rodrigo González, Gonzalo Navarro, and Héctor Ferrada. Locally compressed suffix arrays. ACM Journal of Experimental Algorithmics, 19(1):article 1, 2014. Google Scholar
  13. Dominik Kempa and Tomasz Kociumaka. Resolution of the Burrows-Wheeler transform conjecture. In Proc. 61st IEEE Symposium on Foundations of Computer Science (FOCS), pages 1002-1013, 2020. Google Scholar
  14. John C. Kieffer and En-Hui Yang. Grammar-based codes: A new class of universal lossless source codes. IEEE Transactions on Information Theory, 46(3):737-754, 2000. Google Scholar
  15. Sebastian Kreft and Gonzalo Navarro. On compressing and indexing repetitive sequences. Theoretical Computer Science, 483:115-133, 2013. Google Scholar
  16. Abraham Lempel and Jacob Ziv. On the complexity of finite sequences. IEEE Transactions on Information Theory, 22(1):75-81, 1976. Google Scholar
  17. Veli Mäkinen, Djamal Belazzougui, Fabio Cunial, and Alexandru I. Tomescu. Genome-Scale Algorithm Design: Biological Sequence Analysis in the Era of High-Throughput Sequencing. Cambridge University Press, 2015. Google Scholar
  18. Veli Mäkinen and Gonzalo Navarro. Succinct suffix arrays based on run-length encoding. Nordic Journal of Computing, 12(1):40-66, 2005. Google Scholar
  19. Veli Mäkinen, Gonzalo Navarro, Jouni Sirén, and Niko Välimäki. Storage and retrieval of highly repetitive sequence collections. Journal of Computational Biology, 17(3):281-308, 2010. Google Scholar
  20. Udi Manber and Gene Myers. Suffix arrays: a new method for on-line string searches. SIAM Journal on Computing, 22(5):935-948, 1993. Google Scholar
  21. Gonzalo Navarro. Indexing highly repetitive string collections, part II: Compressed indexes. ACM Computing Surveys, 54(2):article 26, 2021. Google Scholar
  22. Gonzalo Navarro and Veli Mäkinen. Compressed full-text indexes. ACM Computing Surveys, 39(1):article 2, 2007. Google Scholar
  23. Gonzalo Navarro and Víctor Sepúlveda. Practical indexing of repetitive collections using Relative Lempel-Ziv. In Proc. 29th Data Compression Conference (DCC), pages 201-210, 2019. Google Scholar
  24. Takaaki Nishimoto and Yasuo Tabei. Faster queries on BWT-runs compressed indexes. CoRR, 2006.05104, 2020. URL: http://arxiv.org/abs/2006.05104.
  25. Daisuke Okanohara and Kunihiko Sadakane. Practical entropy-compressed rank/select dictionary. In Proc. 9th Workshop on Algorithm Engineering and Experiments (ALENEX), pages 60-70, 2007. Google Scholar
  26. Simon J. Puglisi and Bella Zhukova. Relative Lempel-Ziv compression of suffix arrays. In Proc. 27th International Symposium on String Processing and Information Retrieval (SPIRE), pages 89-96, 2020. Google Scholar
  27. James Robinson, Dominic J. Barker, Xenia Georgiou, Michael A. Cooper, Paul Flicek, and Steven G. E. Marsh. IPD-IMGT/HLA Database. Nucleic Acids Research, 48(D1):D948-D955, 10 2019. Google Scholar
  28. Kunihiko Sadakane. New text indexing functionalities of the compressed suffix arrays. Journal of Algorithms, 48(2):294-313, 2003. Google Scholar
  29. Eric L. Stevens, Ruth Timme, Eric W. Brown, Marc W. Allard, Errol Strain, Kelly Bunning, and Steven Musser. The public health impact of a publically available, environmental database of microbial genomes. Frontiers in Microbiology, 8:808, 2017. Google Scholar
  30. The 1000 Genomes Project Consortium. A global reference for human genetic variation. Nature, 526:68-74, 2015. Google Scholar
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail