License
when quoting this document, please refer to the following
URN: urn:nbn:de:0030-drops-2301
URL: http://drops.dagstuhl.de/opus/volltexte/2005/230/

Masciari, Elio ; Flesca, Sergio ; Manco, Giuseppe ; Pontieri, Luigi ; Pugliese, Andrea

Exploiting Structural Similarity For Effective Web Information Extraction

pdf-format:
Dokument 1.pdf (533 KB)


Abstract

In this paper we propose an architecture that exploit web pages stuctural information for the extraction of relevant information from them. In this architecture, a primary role played by a distance-based classification methodology is devised. Such a methodology is based on an efficient and effective technique for detecting structural similarities among semistructured documents, which significantly differs from standard methods based on graph-matching algorithms. The technique is based on the idea of representing the structure of a document as a time series in which each occurrence of a tag corresponds to a given impulse. By analyzing the frequencies of the corresponding Fourier transform, we can hence state the degree of similarity between documents. Experiments on real data show the effectiveness of the proposed technique.

BibTeX - Entry

@InProceedings{masciari_et_al:DSP:2005:230,
  author =	{Elio Masciari and Sergio Flesca and Giuseppe Manco and Luigi Pontieri and Andrea Pugliese},
  title =	{Exploiting Structural Similarity For Effective Web Information Extraction},
  booktitle =	{Foundations of Semistructured Data},
  year =	{2005},
  editor =	{Frank Neven and Thomas Schwentick and Dan Suciu},
  number =	{05061},
  series =	{Dagstuhl Seminar Proceedings},
  ISSN =	{1862-4405},
  publisher =	{Internationales Begegnungs- und Forschungszentrum f{\"u}r Informatik (IBFI), Schloss Dagstuhl, Germany},
  address =	{Dagstuhl, Germany},
  URL =		{http://drops.dagstuhl.de/opus/volltexte/2005/230},
  annote =	{Keywords: DFT, Web Document Structural Similarity}
}

Keywords: DFT, Web Document Structural Similarity
Seminar: 05061 - Foundations of Semistructured Data
Issue date: 2005
Date of publication: 10.08.2005


DROPS-Home | Fulltext Search | Imprint Published by LZI