Exploiting Structural Similarity For Effective Web Information Extraction

Authors Elio Masciari, Sergio Flesca, Giuseppe Manco, Luigi Pontieri, Andrea Pugliese



PDF
Thumbnail PDF

File

DagSemProc.05061.4.pdf
  • Filesize: 0.52 MB
  • 20 pages

Document Identifiers

Author Details

Elio Masciari
Sergio Flesca
Giuseppe Manco
Luigi Pontieri
Andrea Pugliese

Cite AsGet BibTex

Elio Masciari, Sergio Flesca, Giuseppe Manco, Luigi Pontieri, and Andrea Pugliese. Exploiting Structural Similarity For Effective Web Information Extraction. In Foundations of Semistructured Data. Dagstuhl Seminar Proceedings, Volume 5061, pp. 1-20, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2005)
https://doi.org/10.4230/DagSemProc.05061.4

Abstract

In this paper we propose an architecture that exploit web pages stuctural information for the extraction of relevant information from them. In this architecture, a primary role played by a distance-based classification methodology is devised. Such a methodology is based on an efficient and effective technique for detecting structural similarities among semistructured documents, which significantly differs from standard methods based on graph-matching algorithms. The technique is based on the idea of representing the structure of a document as a time series in which each occurrence of a tag corresponds to a given impulse. By analyzing the frequencies of the corresponding Fourier transform, we can hence state the degree of similarity between documents. Experiments on real data show the effectiveness of the proposed technique.
Keywords
  • DFT
  • Web Document Structural Similarity

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads