Using Statistical Encoding to Achieve Tree Succinctness Never Seen Before

Gańczorz, Michał

doi:10.4230/LIPIcs.STACS.2020.22

Abstract

We propose new entropy measures for trees, the known ones are H_k(?), the k-th order (tree label) entropy (Ferragina at al. 2005), and tree entropy H(?) (Jansson et al. 2006), the former considers only the tree labels and the latter only tree shape. The proposed entropy measures, H_k(?|L) and H_k(L|?), exploit the relation between the labels and the tree shape. We prove that they lower bound label entropy and tree entropy, respectively, i.e. H_k(?|L) ≤ H(?) and H_k(L|?) ≤ H_k(L). Besides being theoretically superior, the new measures are significantly smaller in practice.
We also propose a new succinct representation of labeled trees which represents a tree T using one of the following bounds: |T|(H(?) + H_k(L|?)) or |T|(H_k(?|L) + H_k(L)). The representation is based on a new, simple method of partitioning the tree, which preserves both tree shape and node degrees. The previous state-of-the-art method of compressing the tree achieved |T|(H(?) + H_k(L)) bits, by combining the results of Ferragina at al. 2005 and Jansson et al. 2006; so proposed representation is not worse and often superior. Moreover, our representation supports standard tree navigation in constant time as well as more complex queries. Such a structure achieving this space bounds was not known before: aforementioned solution only worked for compression alone, our structure is the first which achieves H_k(?) for k>0 and supports such queries. Lastly, our data structure is fairly simple, both conceptually and in terms of the implementation, moreover it uses known tools, which is a counter-argument to the claim that methods based on tree-partitioning are impractical.

Janos Aczél. On Shannon’s inequality, optimal coding, and characterizations of Shannon’s and Rényi’s entropies. In Symposia Mathematica, volume 15, pages 153-179, 1973.
Diego Arroyuelo, Rodrigo Cánovas, Gonzalo Navarro, and Kunihiko Sadakane. Succinct trees in practice. In 2010 Proceedings of the Twelfth Workshop on Algorithm Engineering and Experiments (ALENEX), pages 84-97. SIAM, 2010.
Jérémy Barbay, Meng He, J. Ian Munro, and S. Srinivasa Rao. Succinct indexes for strings, binary relations and multi-labeled trees. In Nikhil Bansal, Kirk Pruhs, and Clifford Stein, editors, Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2007, New Orleans, Louisiana, USA, January 7-9, 2007, pages 680-689. SIAM, 2007. URL: http://dl.acm.org/citation.cfm?id=1283383.1283456.
Djamal Belazzougui and Gonzalo Navarro. Optimal lower and upper bounds for representing sequences. ACM Trans. Algorithms, 11(4):31:1-31:21, April 2015. URL: https://doi.org/10.1145/2629339.
Michael A Bender and Martın Farach-Colton. The level ancestor problem simplified. Theoretical Computer Science, 321(1):5-12, 2004.
David Benoit, Erik D Demaine, J Ian Munro, Rajeev Raman, Venkatesh Raman, and S Srinivasa Rao. Representing trees of higher degree. Algorithmica, 43(4):275-292, 2005.
Philip Bille, Anders Roy Christiansen, Nicola Prezza, and Frederik Rye Skjoldjensen. Succinct partial sums and fenwick trees. In Gabriele Fici, Marinella Sciortino, and Rossano Venturini, editors, String Processing and Information Retrieval - 24th International Symposium, SPIRE 2017, Palermo, Italy, September 26-29, 2017, Proceedings, volume 10508 of Lecture Notes in Computer Science, pages 91-96. Springer, 2017. URL: https://doi.org/10.1007/978-3-319-67428-5_8.
Philip Bille, Inge Li Gørtz, Gad M Landau, and Oren Weimann. Tree compression with top trees. In International Colloquium on Automata, Languages, and Programming, pages 160-171. Springer, 2013.
Giorgio Busatto, Markus Lohrey, and Sebastian Maneth. Efficient memory representation of XML document trees. Information Systems, 33(4-5):456-474, 2008.
R. D. Cameron. Source encoding using syntactic information source models. IEEE Transactions on Information Theory, 34(4):843-850, July 1988. URL: https://doi.org/10.1109/18.9782.
Arash Farzan and J Ian Munro. A uniform paradigm to succinctly encode various families of trees. Algorithmica, 68(1):16-40, 2014.
Arash Farzan, Rajeev Raman, and S Srinivasa Rao. Universal succinct representations of trees? In International Colloquium on Automata, Languages, and Programming, pages 451-462. Springer, 2009.
Paolo Ferragina, Fabrizio Luccio, Giovanni Manzini, and S Muthukrishnan. Structuring labeled trees for optimal succinctness, and beyond. In Foundations of Computer Science, 2005. FOCS 2005. 46th Annual IEEE Symposium on, pages 184-193. IEEE, 2005.
Paolo Ferragina, Fabrizio Luccio, Giovanni Manzini, and S Muthukrishnan. Compressing and indexing labeled trees, with applications. Journal of the ACM (JACM), 57(1):4, 2009.
Paolo Ferragina and Giovanni Manzini. Compression boosting in optimal linear time using the Burrows-Wheeler transform. In Proceedings of the Fifteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA '04, pages 655-663, Philadelphia, PA, USA, 2004. Society for Industrial and Applied Mathematics. URL: http://dl.acm.org/citation.cfm?id=982792.982892.
Paolo Ferragina and Rossano Venturini. A simple storage scheme for strings achieving entropy bounds. Theor. Comput. Sci., 372(1):115-121, 2007. URL: https://doi.org/10.1016/j.tcs.2006.12.012.
Travis Gagie, Gonzalo Navarro, and Nicola Prezza. Optimal-time text indexing in BWT-runs bounded space. In Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA '18, pages 1459-1477, Philadelphia, PA, USA, 2018. Society for Industrial and Applied Mathematics. URL: http://dl.acm.org/citation.cfm?id=3174304.3175401.
Moses Ganardi, Danny Hucke, Artur Jeż, Markus Lohrey, and Eric Noeth. Constructing small tree grammars and small circuits for formulas. Journal of Computer and System Sciences, 86:136-158, 2017.
Moses Ganardi, Danny Hucke, Markus Lohrey, and Eric Noeth. Tree compression using string grammars. Algorithmica, 80(3):885-917, 2018. URL: https://doi.org/10.1007/s00453-017-0279-3.
Michał Gańczorz. Entropy lower bounds for dictionary compression. In 30th Annual Symposium on Combinatorial Pattern Matching, CPM 2019, June 18-20, 2019, Pisa, Italy., pages 11:1-11:18, 2019. URL: https://doi.org/10.4230/LIPIcs.CPM.2019.11.
Paweł Gawrychowski and Artur Jeż. LZ77 factorisation of trees. In Akash Lal, S. Akshay, Saket Saurabh, and Sandeep Sen, editors, 36th IARCS Annual Conference on Foundations of Software Technology and Theoretical Computer Science, FSTTCS 2016, December 13-15, 2016, Chennai, India, volume 65 of LIPIcs, pages 35:1-35:15. Schloss Dagstuhl - Leibniz-Zentrum fuer Informatik, 2016. URL: https://doi.org/10.4230/LIPIcs.FSTTCS.2016.35.
Richard F Geary, Rajeev Raman, and Venkatesh Raman. Succinct ordinal trees with level-ancestor queries. ACM Transactions on Algorithms (TALG), 2(4):510-534, 2006.
Rodrigo González and Gonzalo Navarro. Statistical encoding of succinct data structures. In Moshe Lewenstein and Gabriel Valiente, editors, Combinatorial Pattern Matching, 17th Annual Symposium, CPM 2006, Barcelona, Spain, July 5-7, 2006, Proceedings, volume 4009 of Lecture Notes in Computer Science, pages 294-305. Springer, 2006. URL: https://doi.org/10.1007/11780441_27.
Roberto Grossi, Ankur Gupta, and Jeffrey Scott Vitter. High-order entropy-compressed text indexes. In Proceedings of the fourteenth annual ACM-SIAM symposium on Discrete algorithms, pages 841-850. Society for Industrial and Applied Mathematics, 2003.
Roberto Grossi, Alessio Orlandi, and Rajeev Raman. Optimal trade-offs for succinct string indexes. In Samson Abramsky, Cyril Gavoille, Claude Kirchner, Friedhelm Meyer auf der Heide, and Paul G. Spirakis, editors, Automata, Languages and Programming, 37th International Colloquium, ICALP 2010, Bordeaux, France, July 6-10, 2010, Proceedings, Part I, volume 6198 of Lecture Notes in Computer Science, pages 678-689. Springer, 2010. URL: https://doi.org/10.1007/978-3-642-14165-2_57.
Roberto Grossi, Rajeev Raman, Srinivasa Rao Satti, and Rossano Venturini. Dynamic compressed strings with random access. In Automata, Languages, and Programming - 40th International Colloquium, ICALP 2013, Riga, Latvia, July 8-12, 2013, Proceedings, Part I, pages 504-515. Springer, 2013. URL: https://doi.org/10.1007/978-3-642-39206-1_43.
Meng He, J Ian Munro, and S Srinivasa Rao. Succinct ordinal trees based on tree covering. In International Colloquium on Automata, Languages, and Programming, pages 509-520. Springer, 2007.
Meng He, J Ian Munro, and Gelin Zhou. A framework for succinct labeled ordinal trees over large alphabets. Algorithmica, 70(4):696-717, 2014.
Lorenz Hübschle-Schneider and Rajeev Raman. Tree compression with top trees revisited. In International Symposium on Experimental Algorithms, pages 15-27. Springer, 2015.
D. Hucke, M. Lohrey, and L. S. Benkner. Entropy bounds for grammar-based tree compressors. In 2019 IEEE International Symposium on Information Theory (ISIT), pages 1687-1691, July 2019. URL: https://doi.org/10.1109/ISIT.2019.8849372.
Guy Jacobson. Space-efficient static trees and graphs. In Foundations of Computer Science, 1989., 30th Annual Symposium on, pages 549-554. IEEE, 1989.
Guy Joseph Jacobson. Succinct Static Data Structures. PhD thesis, Carnegie Mellon University, Pittsburgh, PA, USA, 1988. AAI8918056.
Jesper Jansson, Kunihiko Sadakane, and Wing-Kin Sung. Ultra-succinct representation of ordered trees. In Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, pages 575-584. Society for Industrial and Applied Mathematics, 2007.
Artur Jeż and Markus Lohrey. Approximation of smallest linear tree grammar. Inf. Comput., 251:215-251, 2016. URL: https://doi.org/10.1016/j.ic.2016.09.007.
Dominik Kempa and Nicola Prezza. At the roots of dictionary compression: String attractors. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2018, pages 827-840, New York, NY, USA, 2018. ACM. URL: https://doi.org/10.1145/3188745.3188814.
Sebastian Kreft and Gonzalo Navarro. Self-indexing based on LZ77. In Annual Symposium on Combinatorial Pattern Matching, pages 41-54. Springer, 2011.
Hsueh-I Lu and Chia-Chi Yeh. Balanced parentheses strike back. ACM Transactions on Algorithms (TALG), 4(3):28, 2008.
J. Ian Munro and Yakov Nekrich. Compressed data structures for dynamic sequences. In Algorithms - ESA 2015 - 23rd Annual European Symposium, Patras, Greece, September 14-16, 2015, Proceedings, pages 891-902, 2015. URL: https://doi.org/10.1007/978-3-662-48350-3_74.
J Ian Munro and Venkatesh Raman. Succinct representation of balanced parentheses, static trees and planar graphs. In Foundations of Computer Science, 1997. Proceedings., 38th Annual Symposium on, pages 118-126. IEEE, 1997.
Gonzalo Navarro and Kunihiko Sadakane. Fully functional static and dynamic succinct trees. ACM Transactions on Algorithms (TALG), 10(3):16, 2014.
Mihai Pǎtraşcu. Succincter. In FOCS'08. IEEE 49th Annual IEEE Symposium on Foundations of Computer Science, 2008., pages 305-313. IEEE, 2008.
Rajeev Raman, Venkatesh Raman, and Srinivasa Rao Satti. Succinct indexable dictionaries with applications to encoding k-ary trees, prefix sums and multisets. ACM Transactions on Algorithms (TALG), 3(4):43, 2007.
Rajeev Raman and Satti Srinivasa Rao. Succinct dynamic dictionaries and trees. In International Colloquium on Automata, Languages, and Programming, pages 357-368. Springer, 2003.
Dekel Tsur. Succinct representation of labeled trees. Theoretical Computer Science, 562:320-329, 2015.

Using Statistical Encoding to Achieve Tree Succinctness Never Seen Before

Author Michał Gańczorz

File

Document Identifiers

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

References

Thanks for your feedback!

Could not send message

Using Statistical Encoding to Achieve Tree Succinctness Never Seen Before

Author Michał Gańczorz

File

Document Identifiers

Related Versions

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

Funding

References

Thanks for your feedback!

Could not send message