Dagstuhl Seminar Proceedings, Volume 5061

Document

05061 Abstracts Collection – Foundations of Semistructured Data

Authors: Frank Neven, Thomas Schwentick, and Dan Suciu

Abstract

From 06.02.05 to 11.02.05, the Dagstuhl Seminar 05061 ``Foundations of Semistructured Data'' was held in the International Conference and Research Center (IBFI), Schloss Dagstuhl. During the seminar, several participants presented their current research, and ongoing work and open problems were discussed. Abstracts of the presentations given during the seminar as well as abstracts of seminar results and ideas are put together in this paper. The first section describes the seminar topics and goals in general. Links to extended abstracts or full papers are provided, if available.

Cite as

Frank Neven, Thomas Schwentick, and Dan Suciu. 05061 Abstracts Collection – Foundations of Semistructured Data. In Foundations of Semistructured Data. Dagstuhl Seminar Proceedings, Volume 5061, pp. 1-13, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2005)

Copy BibTex To Clipboard

@InProceedings{neven_et_al:DagSemProc.05061.1,
  author =	{Neven, Frank and Schwentick, Thomas and Suciu, Dan},
  title =	{{05061 Abstracts Collection – Foundations of Semistructured Data}},
  booktitle =	{Foundations of Semistructured Data},
  pages =	{1--13},
  series =	{Dagstuhl Seminar Proceedings (DagSemProc)},
  ISSN =	{1862-4405},
  year =	{2005},
  volume =	{5061},
  editor =	{Frank Neven and Thomas Schwentick and Dan Suciu},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops-dev.dagstuhl.de/entities/document/10.4230/DagSemProc.05061.1},
  URN =		{urn:nbn:de:0030-drops-2330},
  doi =		{10.4230/DagSemProc.05061.1},
  annote =	{Keywords: Semistructured data, XML, database theory, document processing}
}

Document

DOI: 10.4230/DagSemProc.05061.2

05061 Summary – Foundations of Semi-structured Data

Authors: Frank Neven, Thomas Schwentick, and Dan Suciu

Abstract

As in the first seminar on this topic, the aim o the workshop was to bring together people from the areas related to semi-structured data. However, besides the presentation of recent work, this time the main goal was to identify the main lines of a common framework for future foundational work on semi-structured data. These lines of research are summarized below. The workshop was of a very interdisciplinary nature with invitees from databases, structured documents, programming languages, information retrieval and formal language theory. Several of the lectures were presented by PhD students. We had four invited speakers and a panel on research evaluation. Due to strong connections between topics treated at this workshop, many of the participants initiated new cooperations and research projects.

Cite as

Frank Neven, Thomas Schwentick, and Dan Suciu. 05061 Summary – Foundations of Semi-structured Data. In Foundations of Semistructured Data. Dagstuhl Seminar Proceedings, Volume 5061, pp. 1-5, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2005)

Copy BibTex To Clipboard

@InProceedings{neven_et_al:DagSemProc.05061.2,
  author =	{Neven, Frank and Schwentick, Thomas and Suciu, Dan},
  title =	{{05061 Summary – Foundations of Semi-structured Data}},
  booktitle =	{Foundations of Semistructured Data},
  pages =	{1--5},
  series =	{Dagstuhl Seminar Proceedings (DagSemProc)},
  ISSN =	{1862-4405},
  year =	{2005},
  volume =	{5061},
  editor =	{Frank Neven and Thomas Schwentick and Dan Suciu},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops-dev.dagstuhl.de/entities/document/10.4230/DagSemProc.05061.2},
  URN =		{urn:nbn:de:0030-drops-2276},
  doi =		{10.4230/DagSemProc.05061.2},
  annote =	{Keywords: Report, summary}
}

Document

DOI: 10.4230/DagSemProc.05061.3

Deterministic Automata on Unranked Trees

Authors: Wolfgang Thomas, Julien Christau, and Christof Löding

Abstract

We investigate bottom-up and top-down deterministic automata on unranked trees. We show that for an appropriate definition of bottom-up deterministic automata it is possible to minimize the number of states efficiently and to obtain a unique canonical representative of the accepted tree language. For top-down deterministic automata it is well known that they are less expressive than the non-deterministic ones. By generalizing a corresponding proof from the theory of ranked tree automata we show that it is decidable whether a given regular language of unranked trees can be recognized by a top-down deterministic automaton. The standard deterministic top-down model is slightly weaker than the model we use, where at each node the automaton can scan the sequence of the labels of its successors before deciding its next move.

Cite as

Wolfgang Thomas, Julien Christau, and Christof Löding. Deterministic Automata on Unranked Trees. In Foundations of Semistructured Data. Dagstuhl Seminar Proceedings, Volume 5061, pp. 1-12, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2005)

Copy BibTex To Clipboard

@InProceedings{thomas_et_al:DagSemProc.05061.3,
  author =	{Thomas, Wolfgang and Christau, Julien and L\"{o}ding, Christof},
  title =	{{Deterministic Automata on Unranked Trees}},
  booktitle =	{Foundations of Semistructured Data},
  pages =	{1--12},
  series =	{Dagstuhl Seminar Proceedings (DagSemProc)},
  ISSN =	{1862-4405},
  year =	{2005},
  volume =	{5061},
  editor =	{Frank Neven and Thomas Schwentick and Dan Suciu},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops-dev.dagstuhl.de/entities/document/10.4230/DagSemProc.05061.3},
  URN =		{urn:nbn:de:0030-drops-2281},
  doi =		{10.4230/DagSemProc.05061.3},
  annote =	{Keywords: Automata, unranked trees, parikh automata}
}

Document

DOI: 10.4230/DagSemProc.05061.4

Exploiting Structural Similarity For Effective Web Information Extraction

Authors: Elio Masciari, Sergio Flesca, Giuseppe Manco, Luigi Pontieri, and Andrea Pugliese

Abstract

In this paper we propose an architecture that exploit web pages stuctural information for the extraction of relevant information from them. In this architecture, a primary role played by a distance-based classification methodology is devised. Such a methodology is based on an efficient and effective technique for detecting structural similarities among semistructured documents, which significantly differs from standard methods based on graph-matching algorithms. The technique is based on the idea of representing the structure of a document as a time series in which each occurrence of a tag corresponds to a given impulse. By analyzing the frequencies of the corresponding Fourier transform, we can hence state the degree of similarity between documents. Experiments on real data show the effectiveness of the proposed technique.

Cite as

Elio Masciari, Sergio Flesca, Giuseppe Manco, Luigi Pontieri, and Andrea Pugliese. Exploiting Structural Similarity For Effective Web Information Extraction. In Foundations of Semistructured Data. Dagstuhl Seminar Proceedings, Volume 5061, pp. 1-20, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2005)

Copy BibTex To Clipboard

@InProceedings{masciari_et_al:DagSemProc.05061.4,
  author =	{Masciari, Elio and Flesca, Sergio and Manco, Giuseppe and Pontieri, Luigi and Pugliese, Andrea},
  title =	{{Exploiting Structural Similarity For Effective Web Information Extraction}},
  booktitle =	{Foundations of Semistructured Data},
  pages =	{1--20},
  series =	{Dagstuhl Seminar Proceedings (DagSemProc)},
  ISSN =	{1862-4405},
  year =	{2005},
  volume =	{5061},
  editor =	{Frank Neven and Thomas Schwentick and Dan Suciu},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops-dev.dagstuhl.de/entities/document/10.4230/DagSemProc.05061.4},
  URN =		{urn:nbn:de:0030-drops-2301},
  doi =		{10.4230/DagSemProc.05061.4},
  annote =	{Keywords: DFT, Web Document Structural Similarity}
}

Document

DOI: 10.4230/DagSemProc.05061.5

N-ary Queries by Tree Automata

Authors: Joachim Niehren, Laurent Planque, Jean-Marc Talbot, and Sophie Tison

Abstract

Information extraction from semi-structured documents requires to find n-ary queries in trees that define appropriate sets of n-tuples of nodes. We propose new representation formalisms for n-ary queries by tree automata that we prove to capture MSO. We then investigate n-ary queries by unambiguous tree automata which are relevant for query induction in multi-slot information extraction. We show that this representation formalism captures the class of n-ary queries that are finite unions of Cartesian closed queries, a property we prove decidable.

Cite as

Joachim Niehren, Laurent Planque, Jean-Marc Talbot, and Sophie Tison. N-ary Queries by Tree Automata. In Foundations of Semistructured Data. Dagstuhl Seminar Proceedings, Volume 5061, pp. 1-15, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2005)

Copy BibTex To Clipboard

@InProceedings{niehren_et_al:DagSemProc.05061.5,
  author =	{Niehren, Joachim and Planque, Laurent and Talbot, Jean-Marc and Tison, Sophie},
  title =	{{N-ary Queries by Tree Automata}},
  booktitle =	{Foundations of Semistructured Data},
  pages =	{1--15},
  series =	{Dagstuhl Seminar Proceedings (DagSemProc)},
  ISSN =	{1862-4405},
  year =	{2005},
  volume =	{5061},
  editor =	{Frank Neven and Thomas Schwentick and Dan Suciu},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops-dev.dagstuhl.de/entities/document/10.4230/DagSemProc.05061.5},
  URN =		{urn:nbn:de:0030-drops-2263},
  doi =		{10.4230/DagSemProc.05061.5},
  annote =	{Keywords: Information extraction, semistructured documents, node selecting queries in trees}
}

Document

DOI: 10.4230/DagSemProc.05061.6

Node Identification Schemes for Efficient XML Retrieval

Authors: Felix Weigel, Klaus U. Schulz, and Holger Meuss

Abstract

Node identifiers (IDs) encoding part of the tree structure in XML documents can save I/O for table look-ups, thus speeding up the evaluation of path and tree queries on large persistent document collections. In particular, binary tree relations such as the extended XPath axes can be either decided for a given pair of node IDs, or reconstructed for a single node ID, without access to secondary storage. Several ID schemes have been proposed so far, which differ with respect to (1) expressiveness, i.e. which relations can be decided or reconstructed from IDs, (2) the runtime performance and asymptotic behaviour of decision and reconstruction operations, (3) the storage overhead for the IDs, and (4) robustness, i.e. behaviour in the presence of updates. First we review five ID schemes, positioning them in the trade-off between these four comparison criteria. Then a new ID scheme called BIRD, for Balanced Index-based ID scheme for Reconstruction and Decision, is introduced and illustrated throughout several examples of decision and reconstruction operations on IDs. We argue that emphasizing runtime performance and expressive power, BIRDs strategy in the above trade-off is best for many applications, especially where storage minimization is not the primary goal and updates occur in a bulk-fashion rather than in realtime. Our experimental results on document collections of up to one gigabyte prove BIRD to be most efficient in terms of expressiveness and runtime performance. Most notably, BIRD is the only scheme to support both decision and reconstruction of many relations in constant time. But also in terms of storage and robustness BIRD is highly competitive.

Cite as

Felix Weigel, Klaus U. Schulz, and Holger Meuss. Node Identification Schemes for Efficient XML Retrieval. In Foundations of Semistructured Data. Dagstuhl Seminar Proceedings, Volume 5061, pp. 1-23, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2005)

Copy BibTex To Clipboard

@InProceedings{weigel_et_al:DagSemProc.05061.6,
  author =	{Weigel, Felix and Schulz, Klaus U. and Meuss, Holger},
  title =	{{Node Identification Schemes for Efficient XML Retrieval}},
  booktitle =	{Foundations of Semistructured Data},
  pages =	{1--23},
  series =	{Dagstuhl Seminar Proceedings (DagSemProc)},
  ISSN =	{1862-4405},
  year =	{2005},
  volume =	{5061},
  editor =	{Frank Neven and Thomas Schwentick and Dan Suciu},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops-dev.dagstuhl.de/entities/document/10.4230/DagSemProc.05061.6},
  URN =		{urn:nbn:de:0030-drops-2292},
  doi =		{10.4230/DagSemProc.05061.6},
  annote =	{Keywords: node identification scheme, labelling scheme, numbering scheme, naming scheme, tree encoding, BIRD}
}

Dagstuhl Seminar Proceedings, Volume 5061

Publication Details

Access Numbers

Documents

05061 Abstracts Collection – Foundations of Semistructured Data

Abstract

Cite as

05061 Summary – Foundations of Semi-structured Data

Abstract

Cite as

Deterministic Automata on Unranked Trees

Abstract

Cite as

Exploiting Structural Similarity For Effective Web Information Extraction

Abstract

Cite as

N-ary Queries by Tree Automata

Abstract

Cite as

Node Identification Schemes for Efficient XML Retrieval

Abstract

Cite as

Filters

Thanks for your feedback!

Could not send message