eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Dagstuhl Seminar Proceedings
1862-4405
2008-11-20
8261
1
9
10.4230/DagSemProc.08261.1
article
08261 Abstracts Collection – Structure-Based Compression of Complex Massive Data
Böttcher, Stefan
Lohrey, Markus
Maneth, Sebastian
Rytter, Wojciech
From June 22, 2008 to June 27, 2008 the Dagstuhl Seminar 08261 ``Structure-Based Compression of Complex Massive Data'' was held in the International Conference and Research Center (IBFI), Schloss Dagstuhl.
During the seminar, several participants presented their current
research, and ongoing work and open problems were discussed. Abstracts of
the presentations given during the seminar as well as abstracts of
seminar results and ideas are put together in this paper. The first section
describes the seminar topics and goals in general.
Links to extended abstracts or full papers are provided, if available.
https://drops.dagstuhl.de/storage/16dagstuhl-seminar-proceedings/dsp-vol08261/DagSemProc.08261.1/DagSemProc.08261.1.pdf
Data compression
algorithms for compressed strings and trees
XML-compression
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Dagstuhl Seminar Proceedings
1862-4405
2008-11-20
8261
1
4
10.4230/DagSemProc.08261.2
article
08261 Executive Summary – Structure-Based Compression of Complex Massive Data
Böttcher, Stefan
Lohrey, Markus
Maneth, Sebastian
Rytter, Wojciech
From 22nd June to 27th of June 2008, the Dagstuhl Seminar
``08261 Structure-Based Compression of
Complex Massive Data'' took place at the
Conference and Research Center (IBFI) in Dagstuhl.
22 researchers with interests in theory and application
of compression and computation on compressed structures
met to present their current work and to discuss
future directions.
https://drops.dagstuhl.de/storage/16dagstuhl-seminar-proceedings/dsp-vol08261/DagSemProc.08261.2/DagSemProc.08261.2.pdf
Compression
Succinct Data Structure
Pattern Matching
Text Search
XML Query
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Dagstuhl Seminar Proceedings
1862-4405
2008-11-20
8261
1
16
10.4230/DagSemProc.08261.3
article
A Rewrite Approach for Pattern Containment – Application to Query Evaluation on Compressed Documents
Fila-Kordy, Barbara
In this paper we introduce an approach that allows to handle the containment problem for the fragment XP(/,//,[ ],*) of XPath.
Using rewriting techniques we define a necessary and sufficient condition for pattern containment. This rewrite view is then adapted to query
evaluation on XML documents, and remains valid even if the documents
are given in a compressed form, as dags.
https://drops.dagstuhl.de/storage/16dagstuhl-seminar-proceedings/dsp-vol08261/DagSemProc.08261.3/DagSemProc.08261.3.pdf
Pattern Containment
Compressed Documents
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Dagstuhl Seminar Proceedings
1862-4405
2008-11-20
8261
1
14
10.4230/DagSemProc.08261.4
article
A Space-Saving Approximation Algorithm for Grammar-Based Compression
Sakamoto, Hiroshi
A space-efficient approximation algorithm for the grammar-based compression
problem, which requests for a given string to find a smallest
context-free grammar deriving the string, is presented. For the input
length n and an optimum CFG size g, the algorithm consumes only
O(g log g) space and O(n log^n) time to achieve O((log^n) log n) approximation
ratio to the optimum compression, where log^n is the maximum
number of logarithms satisfying log log · · · logn > 1. This ratio is thus
regarded to almost O(log n), which is the currently best approximation
ratio. While g depends on the string, it is known that g =(log n) and
g=O(n/log_k n) for strings from a k-letter alphabet [12].
https://drops.dagstuhl.de/storage/16dagstuhl-seminar-proceedings/dsp-vol08261/DagSemProc.08261.4/DagSemProc.08261.4.pdf
Grammar based compression
space efficient approximation
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Dagstuhl Seminar Proceedings
1862-4405
2008-11-20
8261
1
0
10.4230/DagSemProc.08261.5
article
An Efficient Algorithm to Test Square-Freeness of Strings Compressed by Balanced Straight Line Program
Matsubara, Wataru
Inenaga, Shunsuke
Shinohara, Ayumi
In this paper we study the problem of deciding whether a
given compressed string contains a square. A string x is called a square
if x = zz and z = u^k implies k = 1 and u = z. A string w is said to be
square-free if no substrings of w are squares. Many efficient algorithms
to test if a given string is square-free, have been developed so far. However,
very little is known for testing square-freeness of a given compressed
string. In this paper, we give an O(max(n^2; n log^2 N))-time O(n^2)-space
solution to test square-freeness of a given compressed string, where n
and N are the size of a given compressed string and the corresponding
decompressed string, respectively. Our input strings are compressed by
balanced straight line program (BSLP). We remark that BSLP has exponential
compression, that is, N = O(2^n). Hence no decompress-then-test
approaches can be better than our method in the worst case.
https://drops.dagstuhl.de/storage/16dagstuhl-seminar-proceedings/dsp-vol08261/DagSemProc.08261.5/DagSemProc.08261.5.pdf
Square Freeness
Straight Line Program
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Dagstuhl Seminar Proceedings
1862-4405
2008-11-20
8261
1
17
10.4230/DagSemProc.08261.6
article
An In-Memory XQuery/XPath Engine over a Compressed Structured Text Representation
Bonifati, Angela
Leighton, Gregory
Mäkinen, Veli
Maneth, Sebastian
Navarro, Gonzalo
Pugliese, Andrea
We describe the architecture and main algorithmic design decisions for an XQuery/XPath processing engine over XML collections which will be represented using a self-indexing approach, that is, a compressed representation that will allow for basic searching and navigational operations in compressed form. The goal is a structure that occupies little space and thus permits
manipulating large collections in main memory.
https://drops.dagstuhl.de/storage/16dagstuhl-seminar-proceedings/dsp-vol08261/DagSemProc.08261.6/DagSemProc.08261.6.pdf
Compressed self-index
compressed XML representation
XPath
XQuery
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Dagstuhl Seminar Proceedings
1862-4405
2008-11-20
8261
1
10
10.4230/DagSemProc.08261.7
article
Clone Detection via Structural Abstraction
Evans, William S.
Fraser, Christoph W.
Ma, Fei
This paper describes the design, implementation, and
application of a new algorithm to detect cloned code. It
operates on the abstract syntax trees formed by many compilers
as an intermediate representation. It extends prior
work by identifying clones even when arbitrary subtrees
have been changed. On a 440,000-line code corpus, 20-
50%of the clones it detected were missed by previous methods.
The method also identifies cloning in declarations, so
it is somewhat more general than conventional procedural
abstraction.
https://drops.dagstuhl.de/storage/16dagstuhl-seminar-proceedings/dsp-vol08261/DagSemProc.08261.7/DagSemProc.08261.7.pdf
Clone Detection
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Dagstuhl Seminar Proceedings
1862-4405
2008-11-20
8261
1
9
10.4230/DagSemProc.08261.8
article
Compression vs Queryability - A Case Study
Anantharaman, Siva
Some compromise on compression is known to be necessary, if the relative
positions of the information stored by semi-structured documents
are to remain accessible under queries. With this in view, we compare,
on an example, the ‘query-friendliness’ of XML documents, when
compressed into straightline tree grammars which are either regular or
context-free. The queries considered are in a limited fragment of XPath,
corresponding to a type of patterns; each such query defines naturally a
non-deterministic, bottom-up ‘query automaton’ that runs just as well on
a tree as on its compressed dag.
https://drops.dagstuhl.de/storage/16dagstuhl-seminar-proceedings/dsp-vol08261/DagSemProc.08261.8/DagSemProc.08261.8.pdf
Tree automata
Tree Grammars
Dags
XML documents
Queries
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Dagstuhl Seminar Proceedings
1862-4405
2008-11-20
8261
1
12
10.4230/DagSemProc.08261.9
article
Optimizing XML Compression in XQueC
Arion, Andrei
Bonifati, Angela
Manolescu, Ioana
Pugliese, Andrea
We present our approach to the problem of optimizing compression choices in the context of the XQueC compressed XML database system. In XQueC, data items are aggregated into containers, which are further grouped to be compressed together. This way, XQueC is able to exploit data commonalities and to perform query evaluation in the compressed domain, with the aim of improving both compression and querying performance. However, different compression
algorithms have different performance and support different sets of operations in the compressed domain. Therefore, choosing how to group containers and which compression algorithm to apply to each group is a challenging issue. We address this problem through an appropriate cost model and a suitable blend of heuristics which, based on a given query workload, are capable of driving
appropriate compression choices.
https://drops.dagstuhl.de/storage/16dagstuhl-seminar-proceedings/dsp-vol08261/DagSemProc.08261.9/DagSemProc.08261.9.pdf
XML compression
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Dagstuhl Seminar Proceedings
1862-4405
2008-11-20
8261
1
14
10.4230/DagSemProc.08261.10
article
Storage and Retrieval of Individual Genomes
Mäkinen, Veli
Navarro, Gonzalo
Sirén, Jouni
Välimäki, Niko
A repetitive sequence collection is one where portions of a emph{base sequence} of length $n$ are repeated many times with small variations, forming a collection of total length $N$. Examples of such collections are version control data and genome sequences of individuals, where the differences can be expressed by lists of basic edit operations. Flexible and efficient data analysis on a such typically huge collection is plausible using suffix trees. However, suffix tree occupies $O(N log N)$ bits, which very soon inhibits
in-memory analyses. Recent advances in full-text emph{self-indexing} reduce the space of suffix tree to $O(N log sigma)$ bits, where $sigma$ is the alphabet size. In practice, the space reduction is more than $10$-fold for example on suffix tree of Human Genome. However, this reduction remains a constant factor when more sequences are added to the collection
We develop a new self-index suited for the repetitive sequence collection setting. Its expected space requirement depends only on the length $n$ of the base sequence and the number $s$ of variations in its repeated copies. That is, the space reduction is no longer constant, but depends on $N/n$.
We believe the structure developed in this work will provide a fundamental basis for storage and retrieval of individual genomes as they become available due to rapid progress in the sequencing technologies.
https://drops.dagstuhl.de/storage/16dagstuhl-seminar-proceedings/dsp-vol08261/DagSemProc.08261.10/DagSemProc.08261.10.pdf
Pattern matching
text indexing
compressed data structures
comparative genomics
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Dagstuhl Seminar Proceedings
1862-4405
2008-11-20
8261
1
27
10.4230/DagSemProc.08261.11
article
SXSAQCT and XSAQCT: XML Queryable Compressors
Müldner, Tomasz
Fry, Christopher
Miziolek, Jan Krzysztof
Durno, Scott
Recently, there has been a growing interest in queryable XML compressors, which can be used to query compressed data with minimal decompression, or even without any decompression. At the same time, there are very few such projects, which have been made available for testing and comparisons. In this paper, we report our current work on two novel queryable XML compressors; a schema-based compressor, SXSAQCT, and a schema-free compressor, XSAQCT. While the work on both compressors is in its early stage, our experiments (reported here) show that our approach may be successfully competing with other known queryable compressors.
https://drops.dagstuhl.de/storage/16dagstuhl-seminar-proceedings/dsp-vol08261/DagSemProc.08261.11/DagSemProc.08261.11.pdf
XML compression
queryable
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Dagstuhl Seminar Proceedings
1862-4405
2008-11-20
8261
1
16
10.4230/DagSemProc.08261.12
article
The XQueC Project: Compressing and Querying XML
Arion, Andrei
Bonifati, Angela
Manolescu, Ioana
Pugliese, Andrea
We outline in this paper the main contributions of the XQueC project. XQueC,
namely XQuery processor and Compressor, is the first compression tool to seamlessly allow XQuery queries in the compressed domain. It includes a set of data structures, that basically shred the XML document into suitable chunks linked to each other, thus disagreeing with the ’homomorphic’ principle so far adopted in previous XML compressors. According to this principle, the compressed document is homomorphic to the original document. Moreover, in order to avoid the time consumption due to compressing and decompressing intermediate query results, XQueC applies ‘lazy’ decompression by issuing the queries directly in the compressed domain.
https://drops.dagstuhl.de/storage/16dagstuhl-seminar-proceedings/dsp-vol08261/DagSemProc.08261.12/DagSemProc.08261.12.pdf
XML compression
Data structures
XQuery querying