Repetition- and Linearity-Aware Rank/Select Dictionaries

Ferragina, Paolo; Manzini, Giovanni; Vinciguerra, Giorgio

doi:10.4230/LIPIcs.ISAAC.2021.64

Abstract

We revisit the fundamental problem of compressing an integer dictionary that supports efficient rank and select operations by exploiting two kinds of regularities arising in real data: repetitiveness and approximate linearity. Our first contribution is a Lempel-Ziv parsing properly enriched to also capture approximate linearity in the data and still be compressed to the kth order entropy. Our second contribution is a variant of the block tree structure whose space complexity takes advantage of both repetitiveness and approximate linearity, and results highly competitive in time too. Our third and final contribution is an implementation and experimentation of this last data structure, which achieves new space-time trade-offs compared to known data structures that exploit only one of the two regularities.

Rachit Agarwal, Anurag Khandelwal, and Ion Stoica. Succinct: enabling queries on compressed data. In Proc. 12th USENIX Symposium on Networked Systems Design and Implementation (NSDI), 2015.
Diego Arroyuelo and Rajeev Raman. Adaptive succinctness. In Proc. 26th International Symposium on String Processing and Information Retrieval (SPIRE), 2019.
Jeremy Barbay and Gonzalo Navarro. Compressed representations of permutations, and applications. In Proc. 26th International Symposium on Theoretical Aspects of Computer Science (STACS), 2009.
Djamal Belazzougui, Paolo Boldi, Rasmus Pagh, and Sebastiano Vigna. Theory and practice of monotone minimal perfect hashing. ACM Journal of Experimental Algorithmics, 16, 2008.
Djamal Belazzougui, Manuel Cáceres, Travis Gagie, Paweł Gawrychowski, Juha Kärkkäinen, Gonzalo Navarro, Alberto Ordóñez, Simon J. Puglisi, and Yasuo Tabei. Block trees. Journal of Computer and System Sciences, 117:1-22, 2021.
Djamal Belazzougui, Patrick Hagge Cording, Simon J. Puglisi, and Yasuo Tabei. Access, rank, and select in grammar-compressed strings. In Proc. 23rd Annual European Symposium on Algorithms (ESA), pages 142-154, 2015.
Antonio Boffa, Paolo Ferragina, and Giorgio Vinciguerra. A "learned" approach to quicken and compress rank/select dictionaries. In Proc. 23rd SIAM Symposium on Algorithm Engineering and Experiments (ALENEX), pages 46-59, 2021.
David Clark. Compact Pat Trees. PhD thesis, University of Waterloo, Canada, 1996.
Peter Elias. Efficient storage and retrieval by content and address of static files. Journal of the ACM, 21(2):246-260, 1974.
Robert Mario Fano. On the number of bits required to implement an associative memory. Memo 61. Massachusetts Institute of Technology, Project MAC, 1971.
Andrea Farruggia, Paolo Ferragina, Antonio Frangioni, and Rossano Venturini. Bicriteria data compression. SIAM Journal on Computing, 48(5):1603-1642, 2019.
Paolo Ferragina, Raffaele Giancarlo, and Giovanni Manzini. The myriad virtues of wavelet trees. Information and Computation, 207(8):849-866, 2009.
Paolo Ferragina, Stefan Kurtz, Stefano Lonardi, and Giovanni Manzini. Computational biology. In Dinesh P. Mehta and Sartaj Sahni, editors, Handbook of Data Structures and Applications, chapter 59. CRC Press, 2nd edition, 2018.
Paolo Ferragina, Fabrizio Lillo, and Giorgio Vinciguerra. On the performance of learned data structures. Theoretical Computer Science, 871:107-120, 2021.
Paolo Ferragina and Giovanni Manzini. Indexing compressed text. Journal of the ACM, 52(4):552-581, 2005.
Paolo Ferragina and Rossano Venturini. A simple storage scheme for strings achieving entropy bounds. Theoretical Computer Science, 372(1):115-121, 2007.
Paolo Ferragina and Giorgio Vinciguerra. The PGM-index: a fully-dynamic compressed learned index with provable worst-case bounds. PVLDB, 13(8):1162-1175, 2020.
Travis Gagie, Gonzalo Navarro, and Nicola Prezza. Fully functional suffix trees and optimal text searching in BWT-runs bounded space. Journal of the ACM, 67(1), 2020.
Simon Gog, Timo Beller, Alistair Moffat, and Matthias Petri. From theory to practice: plug and play with succinct data structures. In Proc. 13th International Symposium on Experimental Algorithms (SEA), pages 326-337, 2014.
Simon Gog, Juha Kärkkäinen, Dominik Kempa, Matthias Petri, and Simon J. Puglisi. Fixed block compression boosting in FM-indexes: Theory and practice. Algorithmica, 81(4):1370-1391, 2019.
Alexander Golynski, Alessio Orlandi, Rajeev Raman, and S. Srinivasa Rao. Optimal indexes for sparse bit vectors. Algorithmica, 69(4):906-924, 2014.
Roberto Grossi and Jeffrey Scott Vitter. Compressed suffix arrays and suffix trees with applications to text indexing and string matching. SIAM Journal on Computing, 35(2):378-407, 2005.
John L. Hennessy and David A. Patterson. A new golden age for computer architecture. Communications of the ACM, 62(2):48-60, 2019.
Juha Kärkkäinen, Dominik Kempa, and Simon J. Puglisi. Hybrid compression of bitvectors for the FM-index. In Proc. 24th Data Compression Conference (DCC), pages 302-311, 2014.
Tomasz Kociumaka, Gonzalo Navarro, and Nicola Prezza. Towards a definitive measure of repetitiveness. In Proc. 14th Latin American Symposium on Theoretical Informatics (LATIN), 2020.
Sambasiva Rao Kosaraju and Giovanni Manzini. Compression of low entropy strings with Lempel-Ziv algorithms. SIAM Journal on Computing, 29(3):893-911, 1999.
Sebastian Kreft and Gonzalo Navarro. On compressing and indexing repetitive sequences. Theoretical Computer Science, 483:115-133, 2013.
Abraham Lempel and Jacob Ziv. On the complexity of finite sequences. IEEE Transactions on Information Theory, 22(1):75-81, 1976.
Veli Mäkinen, Djamal Belazzougui, Fabio Cunial, and Alexandru I. Tomescu. Genome-Scale Algorithm Design. Cambridge University Press, 2015.
Veli Mäkinen and Gonzalo Navarro. Rank and select revisited and extended. Theoretical Computer Science, 387(3):332-347, 2007.
J. Ian Munro. Tables. In Proc. 16th Conference on Foundations of Software Technology and Theoretical Computer Science (FSTTCS), 1996.
J. Ian Munro and Venkatesh Raman. Succinct representation of balanced parentheses, static trees and planar graphs. In Proc. 38th Annual Symposium on Foundations of Computer Science (FOCS), 1997.
Gonzalo Navarro. Spaces, trees, and colors: the algorithmic landscape of document retrieval on sequences. ACM Computing Surveys, 46(4), 2014.
Gonzalo Navarro. Compact data structures: a practical approach. Cambridge University Press, 2016.
Gonzalo Navarro. Indexing highly repetitive string collections, part I: Repetitiveness measures. ACM Computing Surveys, 54(2), 2020.
Gonzalo Navarro and Veli Mäkinen. Compressed full-text indexes. ACM Computing Surveys, 39(1), 2007.
Daisuke Okanohara and Kunihiko Sadakane. Practical entropy-compressed rank/select dictionary. In Proc. 9th Workshop on Algorithm Engineering and Experiments (ALENEX), 2007.
Joseph O'Rourke. An on-line algorithm for fitting straight lines between data ranges. Communications of the ACM, 24(9):574-578, 1981.
Giuseppe Ottaviano, Nicola Tonellotto, and Rossano Venturini. Optimal space-time tradeoffs for inverted indexes. In Proc. 8th ACM International Conference on Web Search and Data Mining (WSDM), 2015.
Mihai Pǎtraşcu. Succincter. In Proc. 49th Annual IEEE Symposium on Foundations of Computer Science (FOCS), 2008.
Rajeev Raman, Venkatesh Raman, and Srinivasa Rao Satti. Succinct indexable dictionaries with applications to encoding k-ary trees, prefix sums and multisets. ACM Transactions on Algorithms, 3(4), 2007.
Sofya Raskhodnikova, Dana Ron, Ronitt Rubinfeld, and Adam Smith. Sublinear algorithms for approximating string compressibility. Algorithmica, 65(3):685-709, 2013.
Kunihiko Sadakane and Roberto Grossi. Squeezing succinct data structures into entropy bounds. In Proc. 17th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 1230-1239, 2006.
Jacob Ziv and Abraham Lempel. A universal algorithm for sequential data compression. IEEE Transactions on Information Theory, 23(3):337-343, 1977.

Repetition- and Linearity-Aware Rank/Select Dictionaries

Authors Paolo Ferragina , Giovanni Manzini , Giorgio Vinciguerra

File

Document Identifiers

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

References

Thanks for your feedback!

Could not send message

Repetition- and Linearity-Aware Rank/Select Dictionaries

Authors Paolo Ferragina , Giovanni Manzini , Giorgio Vinciguerra

File

Document Identifiers

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

Funding

Supplementary Materials

References

Thanks for your feedback!

Could not send message