Repetition- and Linearity-Aware Rank/Select Dictionaries

Authors Paolo Ferragina , Giovanni Manzini , Giorgio Vinciguerra



PDF
Thumbnail PDF

File

LIPIcs.ISAAC.2021.64.pdf
  • Filesize: 0.93 MB
  • 16 pages

Document Identifiers

Author Details

Paolo Ferragina
  • Department of Computer Science, University of Pisa, Italy
Giovanni Manzini
  • Department of Computer Science, University of Pisa, Italy
Giorgio Vinciguerra
  • Department of Computer Science, University of Pisa, Italy

Cite AsGet BibTex

Paolo Ferragina, Giovanni Manzini, and Giorgio Vinciguerra. Repetition- and Linearity-Aware Rank/Select Dictionaries. In 32nd International Symposium on Algorithms and Computation (ISAAC 2021). Leibniz International Proceedings in Informatics (LIPIcs), Volume 212, pp. 64:1-64:16, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021)
https://doi.org/10.4230/LIPIcs.ISAAC.2021.64

Abstract

We revisit the fundamental problem of compressing an integer dictionary that supports efficient rank and select operations by exploiting two kinds of regularities arising in real data: repetitiveness and approximate linearity. Our first contribution is a Lempel-Ziv parsing properly enriched to also capture approximate linearity in the data and still be compressed to the kth order entropy. Our second contribution is a variant of the block tree structure whose space complexity takes advantage of both repetitiveness and approximate linearity, and results highly competitive in time too. Our third and final contribution is an implementation and experimentation of this last data structure, which achieves new space-time trade-offs compared to known data structures that exploit only one of the two regularities.

Subject Classification

ACM Subject Classification
  • Theory of computation → Data compression
  • Theory of computation → Data structures design and analysis
Keywords
  • Data compression
  • Compressed data structures
  • Entropy

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Rachit Agarwal, Anurag Khandelwal, and Ion Stoica. Succinct: enabling queries on compressed data. In Proc. 12th USENIX Symposium on Networked Systems Design and Implementation (NSDI), 2015. Google Scholar
  2. Diego Arroyuelo and Rajeev Raman. Adaptive succinctness. In Proc. 26th International Symposium on String Processing and Information Retrieval (SPIRE), 2019. Google Scholar
  3. Jeremy Barbay and Gonzalo Navarro. Compressed representations of permutations, and applications. In Proc. 26th International Symposium on Theoretical Aspects of Computer Science (STACS), 2009. Google Scholar
  4. Djamal Belazzougui, Paolo Boldi, Rasmus Pagh, and Sebastiano Vigna. Theory and practice of monotone minimal perfect hashing. ACM Journal of Experimental Algorithmics, 16, 2008. Google Scholar
  5. Djamal Belazzougui, Manuel Cáceres, Travis Gagie, Paweł Gawrychowski, Juha Kärkkäinen, Gonzalo Navarro, Alberto Ordóñez, Simon J. Puglisi, and Yasuo Tabei. Block trees. Journal of Computer and System Sciences, 117:1-22, 2021. Google Scholar
  6. Djamal Belazzougui, Patrick Hagge Cording, Simon J. Puglisi, and Yasuo Tabei. Access, rank, and select in grammar-compressed strings. In Proc. 23rd Annual European Symposium on Algorithms (ESA), pages 142-154, 2015. Google Scholar
  7. Antonio Boffa, Paolo Ferragina, and Giorgio Vinciguerra. A "learned" approach to quicken and compress rank/select dictionaries. In Proc. 23rd SIAM Symposium on Algorithm Engineering and Experiments (ALENEX), pages 46-59, 2021. Google Scholar
  8. David Clark. Compact Pat Trees. PhD thesis, University of Waterloo, Canada, 1996. Google Scholar
  9. Peter Elias. Efficient storage and retrieval by content and address of static files. Journal of the ACM, 21(2):246-260, 1974. Google Scholar
  10. Robert Mario Fano. On the number of bits required to implement an associative memory. Memo 61. Massachusetts Institute of Technology, Project MAC, 1971. Google Scholar
  11. Andrea Farruggia, Paolo Ferragina, Antonio Frangioni, and Rossano Venturini. Bicriteria data compression. SIAM Journal on Computing, 48(5):1603-1642, 2019. Google Scholar
  12. Paolo Ferragina, Raffaele Giancarlo, and Giovanni Manzini. The myriad virtues of wavelet trees. Information and Computation, 207(8):849-866, 2009. Google Scholar
  13. Paolo Ferragina, Stefan Kurtz, Stefano Lonardi, and Giovanni Manzini. Computational biology. In Dinesh P. Mehta and Sartaj Sahni, editors, Handbook of Data Structures and Applications, chapter 59. CRC Press, 2nd edition, 2018. Google Scholar
  14. Paolo Ferragina, Fabrizio Lillo, and Giorgio Vinciguerra. On the performance of learned data structures. Theoretical Computer Science, 871:107-120, 2021. Google Scholar
  15. Paolo Ferragina and Giovanni Manzini. Indexing compressed text. Journal of the ACM, 52(4):552-581, 2005. Google Scholar
  16. Paolo Ferragina and Rossano Venturini. A simple storage scheme for strings achieving entropy bounds. Theoretical Computer Science, 372(1):115-121, 2007. Google Scholar
  17. Paolo Ferragina and Giorgio Vinciguerra. The PGM-index: a fully-dynamic compressed learned index with provable worst-case bounds. PVLDB, 13(8):1162-1175, 2020. Google Scholar
  18. Travis Gagie, Gonzalo Navarro, and Nicola Prezza. Fully functional suffix trees and optimal text searching in BWT-runs bounded space. Journal of the ACM, 67(1), 2020. Google Scholar
  19. Simon Gog, Timo Beller, Alistair Moffat, and Matthias Petri. From theory to practice: plug and play with succinct data structures. In Proc. 13th International Symposium on Experimental Algorithms (SEA), pages 326-337, 2014. Google Scholar
  20. Simon Gog, Juha Kärkkäinen, Dominik Kempa, Matthias Petri, and Simon J. Puglisi. Fixed block compression boosting in FM-indexes: Theory and practice. Algorithmica, 81(4):1370-1391, 2019. Google Scholar
  21. Alexander Golynski, Alessio Orlandi, Rajeev Raman, and S. Srinivasa Rao. Optimal indexes for sparse bit vectors. Algorithmica, 69(4):906-924, 2014. Google Scholar
  22. Roberto Grossi and Jeffrey Scott Vitter. Compressed suffix arrays and suffix trees with applications to text indexing and string matching. SIAM Journal on Computing, 35(2):378-407, 2005. Google Scholar
  23. John L. Hennessy and David A. Patterson. A new golden age for computer architecture. Communications of the ACM, 62(2):48-60, 2019. Google Scholar
  24. Juha Kärkkäinen, Dominik Kempa, and Simon J. Puglisi. Hybrid compression of bitvectors for the FM-index. In Proc. 24th Data Compression Conference (DCC), pages 302-311, 2014. Google Scholar
  25. Tomasz Kociumaka, Gonzalo Navarro, and Nicola Prezza. Towards a definitive measure of repetitiveness. In Proc. 14th Latin American Symposium on Theoretical Informatics (LATIN), 2020. Google Scholar
  26. Sambasiva Rao Kosaraju and Giovanni Manzini. Compression of low entropy strings with Lempel-Ziv algorithms. SIAM Journal on Computing, 29(3):893-911, 1999. Google Scholar
  27. Sebastian Kreft and Gonzalo Navarro. On compressing and indexing repetitive sequences. Theoretical Computer Science, 483:115-133, 2013. Google Scholar
  28. Abraham Lempel and Jacob Ziv. On the complexity of finite sequences. IEEE Transactions on Information Theory, 22(1):75-81, 1976. Google Scholar
  29. Veli Mäkinen, Djamal Belazzougui, Fabio Cunial, and Alexandru I. Tomescu. Genome-Scale Algorithm Design. Cambridge University Press, 2015. Google Scholar
  30. Veli Mäkinen and Gonzalo Navarro. Rank and select revisited and extended. Theoretical Computer Science, 387(3):332-347, 2007. Google Scholar
  31. J. Ian Munro. Tables. In Proc. 16th Conference on Foundations of Software Technology and Theoretical Computer Science (FSTTCS), 1996. Google Scholar
  32. J. Ian Munro and Venkatesh Raman. Succinct representation of balanced parentheses, static trees and planar graphs. In Proc. 38th Annual Symposium on Foundations of Computer Science (FOCS), 1997. Google Scholar
  33. Gonzalo Navarro. Spaces, trees, and colors: the algorithmic landscape of document retrieval on sequences. ACM Computing Surveys, 46(4), 2014. Google Scholar
  34. Gonzalo Navarro. Compact data structures: a practical approach. Cambridge University Press, 2016. Google Scholar
  35. Gonzalo Navarro. Indexing highly repetitive string collections, part I: Repetitiveness measures. ACM Computing Surveys, 54(2), 2020. Google Scholar
  36. Gonzalo Navarro and Veli Mäkinen. Compressed full-text indexes. ACM Computing Surveys, 39(1), 2007. Google Scholar
  37. Daisuke Okanohara and Kunihiko Sadakane. Practical entropy-compressed rank/select dictionary. In Proc. 9th Workshop on Algorithm Engineering and Experiments (ALENEX), 2007. Google Scholar
  38. Joseph O'Rourke. An on-line algorithm for fitting straight lines between data ranges. Communications of the ACM, 24(9):574-578, 1981. Google Scholar
  39. Giuseppe Ottaviano, Nicola Tonellotto, and Rossano Venturini. Optimal space-time tradeoffs for inverted indexes. In Proc. 8th ACM International Conference on Web Search and Data Mining (WSDM), 2015. Google Scholar
  40. Mihai Pǎtraşcu. Succincter. In Proc. 49th Annual IEEE Symposium on Foundations of Computer Science (FOCS), 2008. Google Scholar
  41. Rajeev Raman, Venkatesh Raman, and Srinivasa Rao Satti. Succinct indexable dictionaries with applications to encoding k-ary trees, prefix sums and multisets. ACM Transactions on Algorithms, 3(4), 2007. Google Scholar
  42. Sofya Raskhodnikova, Dana Ron, Ronitt Rubinfeld, and Adam Smith. Sublinear algorithms for approximating string compressibility. Algorithmica, 65(3):685-709, 2013. Google Scholar
  43. Kunihiko Sadakane and Roberto Grossi. Squeezing succinct data structures into entropy bounds. In Proc. 17th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 1230-1239, 2006. Google Scholar
  44. Jacob Ziv and Abraham Lempel. A universal algorithm for sequential data compression. IEEE Transactions on Information Theory, 23(3):337-343, 1977. Google Scholar
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail