eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Dagstuhl Seminar Proceedings
1862-4405
2005-08-10
5061
1
13
10.4230/DagSemProc.05061.1
article
05061 Abstracts Collection – Foundations of Semistructured Data
Neven, Frank
Schwentick, Thomas
Suciu, Dan
From 06.02.05 to 11.02.05, the Dagstuhl Seminar
05061 ``Foundations of Semistructured Data'' was held
in the International Conference and Research Center (IBFI),
Schloss Dagstuhl.
During the seminar, several participants presented their current
research, and ongoing work and open problems were discussed. Abstracts of
the presentations given during the seminar as well as abstracts of
seminar results and ideas are put together in this paper. The first section
describes the seminar topics and goals in general.
Links to extended abstracts or full papers are provided, if available.
https://drops.dagstuhl.de/storage/16dagstuhl-seminar-proceedings/dsp-vol05061/DagSemProc.05061.1/DagSemProc.05061.1.pdf
Semistructured data
XML
database theory
document processing
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Dagstuhl Seminar Proceedings
1862-4405
2005-08-10
5061
1
5
10.4230/DagSemProc.05061.2
article
05061 Summary – Foundations of Semi-structured Data
Neven, Frank
Schwentick, Thomas
Suciu, Dan
As in the first seminar on this topic, the aim o the workshop was to bring together people from the areas related to semi-structured data. However, besides the presentation of recent work, this time the main goal was to identify the main lines of a common framework for future foundational work on semi-structured data. These lines of research are summarized below.
The workshop was of a very interdisciplinary nature with invitees from databases, structured documents, programming languages, information retrieval and formal language theory. Several of the lectures were presented by PhD students. We had four invited speakers and a panel on research evaluation. Due to strong connections between topics treated at this workshop, many of the participants initiated new cooperations and research projects.
https://drops.dagstuhl.de/storage/16dagstuhl-seminar-proceedings/dsp-vol05061/DagSemProc.05061.2/DagSemProc.05061.2.pdf
Report
summary
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Dagstuhl Seminar Proceedings
1862-4405
2005-08-10
5061
1
12
10.4230/DagSemProc.05061.3
article
Deterministic Automata on Unranked Trees
Thomas, Wolfgang
Christau, Julien
Löding, Christof
We investigate bottom-up and top-down deterministic automata
on unranked trees. We show that for an appropriate definition of
bottom-up deterministic automata it is possible to minimize the number
of states efficiently and to obtain a unique canonical representative of
the accepted tree language. For top-down deterministic automata it is
well known that they are less expressive than the non-deterministic ones.
By generalizing a corresponding proof from the theory of ranked tree automata
we show that it is decidable whether a given regular language
of unranked trees can be recognized by a top-down deterministic automaton.
The standard deterministic top-down model is slightly weaker
than the model we use, where at each node the automaton can scan the
sequence of the labels of its successors before deciding its next move.
https://drops.dagstuhl.de/storage/16dagstuhl-seminar-proceedings/dsp-vol05061/DagSemProc.05061.3/DagSemProc.05061.3.pdf
Automata
unranked trees
parikh automata
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Dagstuhl Seminar Proceedings
1862-4405
2005-08-10
5061
1
20
10.4230/DagSemProc.05061.4
article
Exploiting Structural Similarity For Effective Web Information Extraction
Masciari, Elio
Flesca, Sergio
Manco, Giuseppe
Pontieri, Luigi
Pugliese, Andrea
In this paper we propose an architecture that exploit web pages stuctural information for the extraction of relevant information from them.
In this architecture, a primary role played by a distance-based classification methodology is devised.
Such a methodology is based on an efficient and effective technique for detecting structural similarities among semistructured documents,
which significantly differs from standard methods based on graph-matching algorithms.
The technique is based on the idea of representing the structure of a document as a time series in which each occurrence
of a tag corresponds to a given impulse. By analyzing the frequencies of the corresponding Fourier transform, we can hence state
the degree of similarity between documents.
Experiments on real data show the effectiveness of the proposed technique.
https://drops.dagstuhl.de/storage/16dagstuhl-seminar-proceedings/dsp-vol05061/DagSemProc.05061.4/DagSemProc.05061.4.pdf
DFT
Web Document Structural Similarity
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Dagstuhl Seminar Proceedings
1862-4405
2005-08-10
5061
1
15
10.4230/DagSemProc.05061.5
article
N-ary Queries by Tree Automata
Niehren, Joachim
Planque, Laurent
Talbot, Jean-Marc
Tison, Sophie
Information extraction from semi-structured documents requires to find
n-ary queries in trees that define appropriate sets of n-tuples of
nodes. We propose new representation formalisms for
n-ary queries by tree automata that we prove to capture MSO. We then
investigate n-ary queries by unambiguous tree automata which are
relevant for query induction in multi-slot information extraction.
We show that this representation formalism captures the
class of n-ary queries that are finite unions of Cartesian closed
queries, a property we prove decidable.
https://drops.dagstuhl.de/storage/16dagstuhl-seminar-proceedings/dsp-vol05061/DagSemProc.05061.5/DagSemProc.05061.5.pdf
Information extraction
semistructured documents
node selecting queries in trees
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Dagstuhl Seminar Proceedings
1862-4405
2005-08-10
5061
1
23
10.4230/DagSemProc.05061.6
article
Node Identification Schemes for Efficient XML Retrieval
Weigel, Felix
Schulz, Klaus U.
Meuss, Holger
Node identifiers (IDs) encoding part of the tree structure in XML documents can save I/O for table look-ups,
thus speeding up the evaluation of path and tree queries on large persistent document collections. In
particular, binary tree relations such as the extended XPath axes can be either decided for a given pair of
node IDs, or reconstructed for a single node ID, without access to secondary storage. Several ID schemes have
been proposed so far, which differ with respect to (1) expressiveness, i.e. which relations can be
decided or reconstructed from IDs, (2) the runtime performance and asymptotic behaviour of decision and
reconstruction operations, (3) the storage overhead for the IDs, and (4) robustness, i.e. behaviour in the
presence of updates. First we review five ID schemes, positioning them in the trade-off between these four comparison
criteria. Then a new ID scheme called BIRD, for Balanced Index-based ID scheme for Reconstruction and
Decision, is introduced and illustrated throughout several examples of decision and reconstruction operations
on IDs. We argue that emphasizing runtime performance and expressive power, BIRDs strategy in the above
trade-off is best for many applications, especially where storage minimization is not the primary goal and
updates occur in a bulk-fashion rather than in realtime. Our experimental results on document collections of
up to one gigabyte prove BIRD to be most efficient in terms of expressiveness and runtime performance. Most
notably, BIRD is the only scheme to support both decision and reconstruction of many relations in constant
time. But also in terms of storage and robustness BIRD is highly competitive.
https://drops.dagstuhl.de/storage/16dagstuhl-seminar-proceedings/dsp-vol05061/DagSemProc.05061.6/DagSemProc.05061.6.pdf
node identification scheme
labelling scheme
numbering scheme
naming scheme
tree encoding
BIRD