An Automatic Data Grabber for Large Web Sites

Valter Crescenzi, Paolo Merialdo, Paolo Missier, Giansalvatore Mecca

Research output: Chapter in Book/Report/Conference proceedingChapter


This chapter investigates a system to automatically grab data from data intensive Websites. The system first infers a model that describes the Website as a collection of classes. Each class represents a set of structurally homogeneous pages, and it is associated with a small set of representative pages. Based on the model, a library of wrappers, one per class, is then inferred with the help an external wrapper generator. The model, together with the library of wrappers, can thus be used to navigate the site and extract the data. The inference process is performed incrementally. The system starts from a given entry point that becomes the first member of the first class in the model. It then refines the model by exploring its boundaries to gather new pages. At each iteration, the system selects a link collection from the model outbound, and iteratively fetches a page by following one of the links in the collection. In order to reduce the number of pages actually visited, after each download the system makes a guess on the class of remaining pages. If looking at the pages already downloaded, there is sufficient evidence that the guess is right, the remaining pages of the collections are assigned to classes without actually fetching them. The process iterates until all the link collections are typed with a known class.

Original languageEnglish
Title of host publicationProceedings 2004 VLDB Conference
Subtitle of host publicationThe 30th International Conference on Very Large Databases (VLDB)
PublisherElsevier Korea
Number of pages4
ISBN (Electronic)9780120884698
Publication statusPublished - 1 Jan 2004

Bibliographical note

Publisher Copyright:
© 2004 Elsevier Inc. All rights reserved.

ASJC Scopus subject areas

  • General Computer Science


Dive into the research topics of 'An Automatic Data Grabber for Large Web Sites'. Together they form a unique fingerprint.

Cite this