A Fast and Small Subsampled R-Index

Cobas, Dustin; Gagie, Travis; Navarro, Gonzalo

doi:10.4230/LIPIcs.CPM.2021.13

File

Subject Classification

ACM Subject Classification

Theory of computation → Pattern matching

Keywords

Pattern matching
r-index
compressed text indexing
repetitive text collections

Metrics

Access Statistics
Total Accesses (updated on a weekly basis)

0

Document

0

Metadata

Abstract

The r-index (Gagie et al., JACM 2020) represented a breakthrough in compressed indexing of repetitive text collections, outperforming its alternatives by orders of magnitude. Its space usage, 𝒪(r) where r is the number of runs in the Burrows-Wheeler Transform of the text, is however larger than Lempel-Ziv and grammar-based indexes, and makes it uninteresting in various real-life scenarios of milder repetitiveness. In this paper we introduce the sr-index, a variant that limits a large fraction of the space to 𝒪(min(r,n/s)) for a text of length n and a given parameter s, at the expense of multiplying by s the time per occurrence reported. The sr-index is obtained by carefully subsampling the text positions indexed by the r-index, in a way that we prove is still able to support pattern matching with guaranteed performance. Our experiments demonstrate that the sr-index sharply outperforms virtually every other compressed index on repetitive texts, both in time and space, even matching the performance of the r-index while using 1.5-3.0 times less space. Only some Lempel-Ziv-based indexes achieve better compression than the sr-index, using about half the space, but they are an order of magnitude slower.

Cite As Get BibTex

Dustin Cobas, Travis Gagie, and Gonzalo Navarro. A Fast and Small Subsampled R-Index. In 32nd Annual Symposium on Combinatorial Pattern Matching (CPM 2021). Leibniz International Proceedings in Informatics (LIPIcs), Volume 191, pp. 13:1-13:16, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021) https://doi.org/10.4230/LIPIcs.CPM.2021.13

Author Details

Dustin Cobas

CeBiB - Center for Biotechnology and Bioengineering, Santiago, Chile
Dept. of Computer Science, University of Chile, Santiago, Chile

Travis Gagie

CeBiB - Center for Biotechnology and Bioengineering, Santiago, Chile
Dalhousie University, Halifax, Canada

Gonzalo Navarro

CeBiB - Center for Biotechnology and Bioengineering, Santiago, Chile
Dept. of Computer Science, University of Chile, Santiago, Chile

References

Christina Boucher, Ondrej Cvacho, Travis Gagie, Jan Holub, Giovanni Manzini, Gonzalo Navarro, and Massimiliano Rossi. PFP compressed suffix trees. In Proc. 23rd Workshop on Algorithm Engineering and Experiments (ALENEX), pages 60-72, 2021.
Christina Boucher, Travis Gagie, Alan Kuhnle, Ben Langmead, Giovanni Manzini, and Taher Mun. Prefix-free parsing for building big BWTs. Algorithms for Molecular Biology, 14(1):13:1-13:15, 2019.
Michael Burrows and David J. Wheeler. A block-sorting lossless data compression algorithm. Technical Report 124, Digital Equipment Corporation, 1994.
David R. Clark. Compact PAT Trees. PhD thesis, University of Waterloo, Canada, 1996.
Francisco Claude, Gonzalo Navarro, and Alejandro Pacheco. Grammar-compressed indexes with logarithmic search time. Journal of Computer and System Sciences, 118:53-74, 2021.
Diego Díaz-Domínguez and Gonzalo Navarro. A grammar compressor for collections of reads with applications to the construction of the BWT. In Proc. 31st Data Compression Conference (DCC), pages 93-102, 2021.
Héctor Ferrada, Dominik Kempa, and Simon J. Puglisi. Hybrid indexing revisited. In Proc. 20th Workshop on Algorithm Engineering and Experiments (ALENEX), pages 1-8, 2018.
Paolo Ferragina and Giovanni Manzini. Indexing Compressed Text. Journal of the ACM, 52(4):552-581, 2005.
Paolo Ferragina, Giovanni Manzini, Veli Mäkinen, and Gonzalo Navarro. Compressed Representations of Sequences and Full-text Indexes. ACM Transactions on Algorithms, 3(2), 2007.
Travis Gagie and Gonzalo Navarro. Compressed Indexes for Repetitive Textual Datasets, pages 475-480. Springer, 2019.
Travis Gagie, Gonzalo Navarro, and Nicola Prezza. Fully-functional suffix trees and optimal text searching in BWT-runs bounded space. Journal of the ACM, 67(1):article 2, 2020.
Rodrigo González, Gonzalo Navarro, and Héctor Ferrada. Locally compressed suffix arrays. ACM Journal of Experimental Algorithmics, 19(1):article 1, 2014.
Dominik Kempa and Tomasz Kociumaka. Resolution of the Burrows-Wheeler transform conjecture. In Proc. 61st IEEE Symposium on Foundations of Computer Science (FOCS), pages 1002-1013, 2020.
John C. Kieffer and En-Hui Yang. Grammar-based codes: A new class of universal lossless source codes. IEEE Transactions on Information Theory, 46(3):737-754, 2000.
Sebastian Kreft and Gonzalo Navarro. On compressing and indexing repetitive sequences. Theoretical Computer Science, 483:115-133, 2013.
Abraham Lempel and Jacob Ziv. On the complexity of finite sequences. IEEE Transactions on Information Theory, 22(1):75-81, 1976.
Veli Mäkinen, Djamal Belazzougui, Fabio Cunial, and Alexandru I. Tomescu. Genome-Scale Algorithm Design: Biological Sequence Analysis in the Era of High-Throughput Sequencing. Cambridge University Press, 2015.
Veli Mäkinen and Gonzalo Navarro. Succinct suffix arrays based on run-length encoding. Nordic Journal of Computing, 12(1):40-66, 2005.
Veli Mäkinen, Gonzalo Navarro, Jouni Sirén, and Niko Välimäki. Storage and retrieval of highly repetitive sequence collections. Journal of Computational Biology, 17(3):281-308, 2010.
Udi Manber and Gene Myers. Suffix arrays: a new method for on-line string searches. SIAM Journal on Computing, 22(5):935-948, 1993.
Gonzalo Navarro. Indexing highly repetitive string collections, part II: Compressed indexes. ACM Computing Surveys, 54(2):article 26, 2021.
Gonzalo Navarro and Veli Mäkinen. Compressed full-text indexes. ACM Computing Surveys, 39(1):article 2, 2007.
Gonzalo Navarro and Víctor Sepúlveda. Practical indexing of repetitive collections using Relative Lempel-Ziv. In Proc. 29th Data Compression Conference (DCC), pages 201-210, 2019.
Takaaki Nishimoto and Yasuo Tabei. Faster queries on BWT-runs compressed indexes. CoRR, 2006.05104, 2020. URL: http://arxiv.org/abs/2006.05104.
Daisuke Okanohara and Kunihiko Sadakane. Practical entropy-compressed rank/select dictionary. In Proc. 9th Workshop on Algorithm Engineering and Experiments (ALENEX), pages 60-70, 2007.
Simon J. Puglisi and Bella Zhukova. Relative Lempel-Ziv compression of suffix arrays. In Proc. 27th International Symposium on String Processing and Information Retrieval (SPIRE), pages 89-96, 2020.
James Robinson, Dominic J. Barker, Xenia Georgiou, Michael A. Cooper, Paul Flicek, and Steven G. E. Marsh. IPD-IMGT/HLA Database. Nucleic Acids Research, 48(D1):D948-D955, 10 2019.
Kunihiko Sadakane. New text indexing functionalities of the compressed suffix arrays. Journal of Algorithms, 48(2):294-313, 2003.
Eric L. Stevens, Ruth Timme, Eric W. Brown, Marc W. Allard, Errol Strain, Kelly Bunning, and Steven Musser. The public health impact of a publically available, environmental database of microbial genomes. Frontiers in Microbiology, 8:808, 2017.
The 1000 Genomes Project Consortium. A global reference for human genetic variation. Nature, 526:68-74, 2015.

A Fast and Small Subsampled R-Index

Authors Dustin Cobas , Travis Gagie , Gonzalo Navarro

File

Document Identifiers

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

References

Thanks for your feedback!

Could not send message

A Fast and Small Subsampled R-Index

Authors Dustin Cobas , Travis Gagie , Gonzalo Navarro

File

Document Identifiers

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

Funding

Supplementary Materials

References

Thanks for your feedback!

Could not send message