Fine-grain web site structure discovery

Valter Crescenzi*, Paolo Merialdo, Paolo Missier

*Corresponding author for this work

Research output: Contribution to conference (unpublished)Paperpeer-review

Abstract

Several techniques have been recently proposed to automatically derive web wrappers, i.e., programs that extract data from HTML pages, and transform them into a more structured format, typically in XML syntax. These techniques automatically induce a wrapper from a set of sample pages that share a common HTML template. An open issue, however, is how to collect suitable classes of sample pages to feed the wrapper inducer. Presently, the pages are chosen manually. In this paper, we tackle the problem of automatically discovering the main classes of pages offered by a site by exploring only a small, representative, portion of it. The web site model we propose describes the structure of the site as a graph whose nodes are classes of pages that share a common structure, and whose edges represent links among instances of the page classes. Using this model, we have developed an algorithm that accepts the url of an entry point to the target web site, visits a limited portion of the site, and produces an accurate model of the site structure. We also report on preliminary experiments performed on actual web sites, that have produced encouraging results.

Original languageEnglish
Pages15-22
Number of pages8
DOIs
Publication statusPublished - 2003
EventWIDM 2003: Proceedings of the Fifth ACM International Workshop on Web Information and Data Management - New Orleans, LA, United States
Duration: 7 Nov 20038 Nov 2003

Conference

ConferenceWIDM 2003: Proceedings of the Fifth ACM International Workshop on Web Information and Data Management
Country/TerritoryUnited States
CityNew Orleans, LA
Period7/11/038/11/03

Keywords

  • Clustering
  • Information Extraction
  • Web Information Systems
  • Web Modeling
  • Wrapper Induction

ASJC Scopus subject areas

  • Computer Networks and Communications
  • Information Systems

Fingerprint

Dive into the research topics of 'Fine-grain web site structure discovery'. Together they form a unique fingerprint.

Cite this