Entropy Lower Bounds for Dictionary Compression

Gańczorz, Michał

doi:10.4230/LIPIcs.CPM.2019.11

Abstract

We show that a wide class of dictionary compression methods (including LZ77, LZ78, grammar compressors as well as parsing-based structures) require |S|H_k(S) + Omega (|S|k log sigma/log_sigma |S|) bits to encode their output. This matches known upper bounds and improves the information-theoretic lower bound of |S|H_k(S). To this end, we abstract the crucial properties of parsings created by those methods, construct a certain family of strings and analyze the parsings of those strings. We also show that for k = alpha log_sigma |S|, where 0 < alpha < 1 is a constant, the aforementioned methods produce an output of size at least 1/(1-alpha)|S|H_k(S) bits. Thus our results separate dictionary compressors from context-based one (such as PPM) and BWT-based ones, as the those include methods achieving |S|H_k(S) + O(sigma^k log sigma) bits, i.e. the redundancy depends on k and sigma but not on |S|.

Test results for data from Canterbury Corpus. http://corpus.canterbury.ac.nz/details/cantrbry/RatioByRatio.html. Accessed: 2019-01-17.
Test results for data from Pizza & Chilli corpus. http://pizzachili.dcc.uchile.cl/texts.html. Accessed: 2019-01-17.
Anisa Al-Hafeedh, Maxime Crochemore, Lucian Ilie, Evguenia Kopylova, William F. Smyth, German Tischler, and Munina Yusufu. A comparison of index-based Lempel-Ziv LZ77 factorization algorithms. ACM Comput. Surv., 45(1):5, 2012.
Alberto Apostolico and Stefano Lonardi. Some theory and practice of greedy off-line textual substitution. In Data Compression Conference, 1998. DCC'98. Proceedings, pages 119-128. IEEE, 1998.
Alberto Apostolico and Stefano Lonardi. Compression of Biological Sequences by Greedy Off-Line Textual Substitution. In Proceedings of the Conference on Data Compression, DCC '00, pages 143-152, Washington, DC, USA, 2000. IEEE Computer Society. URL: http://dl.acm.org/citation.cfm?id=789087.789789.
Moses Charikar, Eric Lehman, Ding Liu, Rina Panigrahy, Manoj Prabhakaran, Amit Sahai, and Abhi Shelat. The smallest grammar problem. IEEE Trans. Information Theory, 51(7):2554-2576, 2005. URL: http://dx.doi.org/10.1109/TIT.2005.850116.
Francisco Claude and Gonzalo Navarro. Self-Indexed Grammar-Based Compression. Fundam. Inform., 111(3):313-337, 2011. URL: http://dx.doi.org/10.3233/FI-2011-565.
Francisco Claude and Gonzalo Navarro. Improved Grammar-Based Compressed Indexes. In String Processing and Information Retrieval - 19th International Symposium, SPIRE 2012, Cartagena de Indias, Colombia, October 21-25, 2012. Proceedings, pages 180-192, 2012. URL: http://dx.doi.org/10.1007/978-3-642-34109-0_19.
J. Cleary and I. Witten. Data Compression Using Adaptive Coding and Partial String Matching. IEEE Transactions on Communications, 32(4):396-402, April 1984. URL: http://dx.doi.org/10.1109/TCOM.1984.1096090.
Łukasz Dębowski. Is Natural Language a Perigraphic Process? The Theorem about Facts and Words Revisited. Entropy, 20(2):85, 2018. URL: http://dx.doi.org/10.3390/e20020085.
Paolo Ferragina and Giovanni Manzini. Compression Boosting in Optimal Linear Time Using the Burrows-Wheeler Transform. In Proceedings of the Fifteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA '04, pages 655-663, Philadelphia, PA, USA, 2004. Society for Industrial and Applied Mathematics. URL: http://dl.acm.org/citation.cfm?id=982792.982892.
Paolo Ferragina, Giovanni Manzini, Veli Mäkinen, and Gonzalo Navarro. Compressed representations of sequences and full-text indexes. ACM Trans. Algorithms, 3(2):20, 2007. URL: http://dx.doi.org/10.1145/1240233.1240243.
Paolo Ferragina and Rossano Venturini. A simple storage scheme for strings achieving entropy bounds. Theor. Comput. Sci., 372(1):115-121, 2007. URL: http://dx.doi.org/10.1016/j.tcs.2006.12.012.
Travis Gagie. Large alphabets and incompressibility. Inf. Process. Lett., 99(6):246-251, 2006. URL: http://dx.doi.org/10.1016/j.ipl.2006.04.008.
Michał Gańczorz. Entropy bounds for grammar compression. CoRR, abs/1804.08547, 2018. URL: http://arxiv.org/abs/1804.08547.
Rodrigo González and Gonzalo Navarro. Statistical Encoding of Succinct Data Structures. In Moshe Lewenstein and Gabriel Valiente, editors, Combinatorial Pattern Matching, 17th Annual Symposium, CPM 2006, Barcelona, Spain, July 5-7, 2006, Proceedings, volume 4009 of Lecture Notes in Computer Science, pages 294-305. Springer, 2006. URL: http://dx.doi.org/10.1007/11780441_27.
Roberto Grossi, Rajeev Raman, Srinivasa Rao Satti, and Rossano Venturini. Dynamic Compressed Strings with Random Access. In Automata, Languages, and Programming - 40th International Colloquium, ICALP 2013, Riga, Latvia, July 8-12, 2013, Proceedings, Part I, pages 504-515. Springer, 2013. URL: http://dx.doi.org/10.1007/978-3-642-39206-1_43.
Danny Hucke, Artur Jeż, and Markus Lohrey. Approximation ratio of RePair. CoRR, abs/1703.06061, 2017.
Juha Kärkkäinen and Erkki Sutinen. Lempel-Ziv Index for q-Grams. Algorithmica, 21(1):137-154, 1998. URL: http://dx.doi.org/10.1007/PL00009205.
John C. Kieffer and En-Hui Yang. Grammar-based codes: A new class of universal lossless source codes. IEEE Transactions on Information Theory, 46(3):737-754, 2000. URL: http://dx.doi.org/10.1109/18.841160.
S. Rao Kosaraju and Giovanni Manzini. Compression of Low Entropy Strings with Lempel-Ziv Algorithms. SIAM J. Comput., 29(3):893-911, 1999. URL: http://dx.doi.org/10.1137/S0097539797331105.
Sebastian Kreft and Gonzalo Navarro. Self-indexing Based on LZ77. In Combinatorial Pattern Matching - 22nd Annual Symposium, CPM 2011, Palermo, Italy, June 27-29, 2011. Proceedings, pages 41-54, 2011. URL: http://dx.doi.org/10.1007/978-3-642-21458-5_6.
N. Jesper Larsson and Alistair Moffat. Offline Dictionary-Based Compression. In Data Compression Conference, pages 296-305. IEEE Computer Society, 1999. URL: http://dx.doi.org/10.1109/DCC.1999.755679.
Veli Mäkinen and Gonzalo Navarro. Dynamic entropy-compressed sequences and full-text indexes. ACM Trans. Algorithms, 4(3):32:1-32:38, 2008. URL: http://dx.doi.org/10.1145/1367064.1367072.
Gonzalo Navarro. Indexing Text Using the Ziv-Lempel Trie. J. of Discrete Algorithms, 2(1):87-114, March 2004. URL: http://dx.doi.org/10.1016/S1570-8667(03)00066-2.
Gonzalo Navarro and Veli Mäkinen. Compressed Full-text Indexes. ACM Comput. Surv., 39(1), April 2007. URL: http://dx.doi.org/10.1145/1216370.1216372.
Gonzalo Navarro and Luís M. S. Russo. Re-Pair Achieves High-Order Entropy. In DCC, page 537. IEEE Computer Society, 2008. URL: http://dx.doi.org/10.1109/DCC.2008.79.
Craig G Nevill-Manning and Ian H Witten. Compression and explanation using hierarchical grammars. The Computer Journal, 40(2 and 3):103-116, 1997.
Craig G. Nevill-Manning and Ian H. Witten. Identifying Hierarchical Structure in Sequences: A Linear-Time Algorithm. J. Artif. Intell. Res. (JAIR), 7:67-82, 1997. URL: http://dx.doi.org/10.1613/jair.374.
C. Ochoa and G. Navarro. RePair and All Irreducible Grammars are Upper Bounded by High-Order Empirical Entropy. IEEE Transactions on Information Theory, 2019. to appear.
Kunihiko Sadakane and Roberto Grossi. Squeezing succinct data structures into entropy bounds. In Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm, pages 1230-1239. Society for Industrial and Applied Mathematics, 2006.
Michal Vasinek and Jan Platos. Prediction and Evaluation of Zero Order Entropy Changes in Grammar-Based Codes. Entropy, 19(5):223, 2017. URL: http://dx.doi.org/10.3390/e19050223.
En-Hui Yang and John C. Kieffer. Efficient universal lossless data compression algorithms based on a greedy sequential grammar transform - Part one: Without context models. IEEE Trans. Information Theory, 46(3):755-777, 2000. URL: http://dx.doi.org/10.1109/18.841161.
Jacob Ziv and Abraham Lempel. A universal algorithm for sequential data compression. IEEE Trans. Information Theory, 23(3):337-343, 1977. URL: http://dx.doi.org/10.1109/TIT.1977.1055714.
Jacob Ziv and Abraham Lempel. Compression of individual sequences via variable-rate coding. IEEE Trans. Information Theory, 24(5):530-536, 1978. URL: http://dx.doi.org/10.1109/TIT.1978.1055934.

Entropy Lower Bounds for Dictionary Compression

Author Michał Gańczorz

File

Document Identifiers

Author Details

Cite As Get BibTex

Abstract

Subject Classification

ACM Subject Classification

Keywords

Metrics

References

Thanks for your feedback!

Could not send message

Entropy Lower Bounds for Dictionary Compression

Author Michał Gańczorz

File

Document Identifiers

Author Details

Funding

Cite As Get BibTex

Abstract

Subject Classification

ACM Subject Classification

Keywords

Metrics

References

Thanks for your feedback!

Could not send message