Abstract
This chapter investigates a system to automatically grab data from data intensive Websites. The system first infers a model that describes the Website as a collection of classes. Each class represents a set of structurally homogeneous pages, and it is associated with a small set of representative pages. Based on the model, a library of wrappers, one per class, is then inferred with the help an external wrapper generator. The model, together with the library of wrappers, can thus be used to navigate the site and extract the data. The inference process is performed incrementally. The system starts from a given entry point that becomes the first member of the first class in the model. It then refines the model by exploring its boundaries to gather new pages. At each iteration, the system selects a link collection from the model outbound, and iteratively fetches a page by following one of the links in the collection. In order to reduce the number of pages actually visited, after each download the system makes a guess on the class of remaining pages. If looking at the pages already downloaded, there is sufficient evidence that the guess is right, the remaining pages of the collections are assigned to classes without actually fetching them. The process iterates until all the link collections are typed with a known class.
Original language | English |
---|---|
Title of host publication | Proceedings 2004 VLDB Conference |
Subtitle of host publication | The 30th International Conference on Very Large Databases (VLDB) |
Publisher | Elsevier Korea |
Pages | 1321-1324 |
Number of pages | 4 |
ISBN (Electronic) | 9780120884698 |
DOIs | |
Publication status | Published - 1 Jan 2004 |
Bibliographical note
Publisher Copyright:© 2004 Elsevier Inc. All rights reserved.
ASJC Scopus subject areas
- General Computer Science