Clustering Web pages based on their structure

Valter Crescenzi; Paolo Merialdo; Paolo Missier

doi:10.1016/j.datak.2004.11.004

Clustering Web pages based on their structure

Valter Crescenzi, Paolo Merialdo^*, Paolo Missier

^*Corresponding author for this work

Computer Science

Research output: Contribution to journal › Conference article › peer-review

Abstract

Several techniques have been recently proposed to automatically generate Web wrappers, i.e., programs that extract data from HTML pages, and transform them into a more structured format, typically in XML. These techniques automatically induce a wrapper from a set of sample pages that share a common HTML template. An open issue, however, is how to collect suitable classes of sample pages to feed the wrapper inducer. Presently, the pages are chosen manually. In this paper, we tackle the problem of automatically discovering the main classes of pages offered by a site by exploring only a small yet representative portion of it. We propose a model to describe abstract structural features of HTML pages. Based on this model, we have developed an algorithm that accepts the URL of an entry point to a target Web site, visits a limited yet representative number of pages, and produces an accurate clustering of pages based on their structure. We have developed a prototype, which has been used to perform experiments on real-life Web sites.

Original language	English
Pages (from-to)	279-299
Number of pages	21
Journal	Data and Knowledge Engineering
Volume	54
Issue number	3
DOIs	https://doi.org/10.1016/j.datak.2004.11.004
Publication status	Published - 2005
Event	Fifth ACM International Workshop on Web Information and Data Management (WIDM 2003) - Duration: 7 Nov 2003 → 8 Nov 2003

Keywords

Clustering
Information extraction
Web mining
Web modelling
Wrapper induction

ASJC Scopus subject areas

Information Systems and Management

Access to Document

10.1016/j.datak.2004.11.004

Cite this

@article{d573d91e15c54fd9b0f0f6430d833ad8,

title = "Clustering Web pages based on their structure",

abstract = "Several techniques have been recently proposed to automatically generate Web wrappers, i.e., programs that extract data from HTML pages, and transform them into a more structured format, typically in XML. These techniques automatically induce a wrapper from a set of sample pages that share a common HTML template. An open issue, however, is how to collect suitable classes of sample pages to feed the wrapper inducer. Presently, the pages are chosen manually. In this paper, we tackle the problem of automatically discovering the main classes of pages offered by a site by exploring only a small yet representative portion of it. We propose a model to describe abstract structural features of HTML pages. Based on this model, we have developed an algorithm that accepts the URL of an entry point to a target Web site, visits a limited yet representative number of pages, and produces an accurate clustering of pages based on their structure. We have developed a prototype, which has been used to perform experiments on real-life Web sites.",

keywords = "Clustering, Information extraction, Web mining, Web modelling, Wrapper induction",

author = "Valter Crescenzi and Paolo Merialdo and Paolo Missier",

year = "2005",

doi = "10.1016/j.datak.2004.11.004",

language = "English",

volume = "54",

pages = "279--299",

journal = "Data and Knowledge Engineering",

issn = "0169-023X",

publisher = "Elsevier",

number = "3",

note = "Fifth ACM International Workshop on Web Information and Data Management (WIDM 2003) ; Conference date: 07-11-2003 Through 08-11-2003",

}

TY - JOUR

T1 - Clustering Web pages based on their structure

AU - Crescenzi, Valter

AU - Merialdo, Paolo

AU - Missier, Paolo

PY - 2005

Y1 - 2005

N2 - Several techniques have been recently proposed to automatically generate Web wrappers, i.e., programs that extract data from HTML pages, and transform them into a more structured format, typically in XML. These techniques automatically induce a wrapper from a set of sample pages that share a common HTML template. An open issue, however, is how to collect suitable classes of sample pages to feed the wrapper inducer. Presently, the pages are chosen manually. In this paper, we tackle the problem of automatically discovering the main classes of pages offered by a site by exploring only a small yet representative portion of it. We propose a model to describe abstract structural features of HTML pages. Based on this model, we have developed an algorithm that accepts the URL of an entry point to a target Web site, visits a limited yet representative number of pages, and produces an accurate clustering of pages based on their structure. We have developed a prototype, which has been used to perform experiments on real-life Web sites.

AB - Several techniques have been recently proposed to automatically generate Web wrappers, i.e., programs that extract data from HTML pages, and transform them into a more structured format, typically in XML. These techniques automatically induce a wrapper from a set of sample pages that share a common HTML template. An open issue, however, is how to collect suitable classes of sample pages to feed the wrapper inducer. Presently, the pages are chosen manually. In this paper, we tackle the problem of automatically discovering the main classes of pages offered by a site by exploring only a small yet representative portion of it. We propose a model to describe abstract structural features of HTML pages. Based on this model, we have developed an algorithm that accepts the URL of an entry point to a target Web site, visits a limited yet representative number of pages, and produces an accurate clustering of pages based on their structure. We have developed a prototype, which has been used to perform experiments on real-life Web sites.

KW - Clustering

KW - Information extraction

KW - Web mining

KW - Web modelling

KW - Wrapper induction

UR - http://www.scopus.com/inward/record.url?scp=18844436436&partnerID=8YFLogxK

U2 - 10.1016/j.datak.2004.11.004

DO - 10.1016/j.datak.2004.11.004

M3 - Conference article

AN - SCOPUS:18844436436

SN - 0169-023X

VL - 54

SP - 279

EP - 299

JO - Data and Knowledge Engineering

JF - Data and Knowledge Engineering

IS - 3

T2 - Fifth ACM International Workshop on Web Information and Data Management (WIDM 2003)

Y2 - 7 November 2003 through 8 November 2003

ER -

Clustering Web pages based on their structure

Abstract

Keywords

ASJC Scopus subject areas

Access to Document

Fingerprint

Cite this