Abstract
Trees are a ubiquitous data structure in computer science. LISP, for instance, was designed to manipulate nested lists, that is, ordered unranked trees. Already at that time, DAGs were used to detect common subexpression, a process known as "hash consing." In a DAG every distinct subtree is represented only once (but can be referenced many times) and hence it constitutes a dictionarybased compression method for ordered trees. In our compression scenario we distinguish two kinds of ordered trees: binary and unranked. The latter appear naturally as representation of XML document structures. We survey these dictionarybased compression methods for ordered trees:
(1) DAGs,
(2) hybrid DAGs,
(3) straightline contextfree tree grammars ("SLT grammars").
We compare the minimal DAG of an unranked tree with the minimal DAG of its binary tree encoding. The latter is obtained by identifying first children of the unranked tree with left children of the binary tree, and nextsiblings with the right children. For XML document trees, unranked DAGs are usually smaller than encoded binary DAGs. We show that this holds for arbitrary unranked trees, on average. We also present the "hybrid DAG"; its size lowerbounds those of the binary and unranked DAGs.
Finding a smallest SLT grammar for a given tree is NPcomplete. We discuss two lineartime approximation algorithms: BPLEX and TreeRePair. For typical XML document trees, TreeRePair produces SLT grammars that are only one fourth of the size of the minimal DAG, and which contain approximately 3$% of the edges of the original tree. As far as we know, this gives rise to the smallest existing pointerbased tree representation.
We show that some basic algorithms can be computed directly on the compressed trees, without prior decompression. Examples include the execution of different kinds of tree automata, and the realtime traversal of the original tree. It is even possible to evaluate simple XPath queries directly on the SLT grammars, using deterministic nodeselecting tree automata. In this way, impressive speedups are achieved over existing XPath evaluators, while at the same time the memory requirement is slashed to only a few percent. For more complex XPath queries that require nondeterministic nodeselecting tree automata, efficient evaluation over SLT grammars remains a difficult challenge.
BibTeX  Entry
@InProceedings{maneth:LIPIcs:2012:3480,
author = {Sebastian Maneth},
title = {{DictionaryBased Tree Compression (Invited Talk)}},
booktitle = {23rd International Conference on Rewriting Techniques and Applications (RTA'12) },
pages = {55},
series = {Leibniz International Proceedings in Informatics (LIPIcs)},
ISBN = {9783939897385},
ISSN = {18688969},
year = {2012},
volume = {15},
editor = {Ashish Tiwari},
publisher = {Schloss DagstuhlLeibnizZentrum fuer Informatik},
address = {Dagstuhl, Germany},
URL = {http://drops.dagstuhl.de/opus/volltexte/2012/3480},
URN = {urn:nbn:de:0030drops34802},
doi = {http://dx.doi.org/10.4230/LIPIcs.RTA.2012.5},
annote = {Keywords: Tree grammars, tree automata, straightline programs}
}
Keywords: 

Tree grammars, tree automata, straightline programs 
Seminar: 

23rd International Conference on Rewriting Techniques and Applications (RTA'12) 
Issue Date: 

2012 
Date of publication: 

15.05.2012 