Joint Repairs for Web Wrappers

Stefano Ortona, Giorgio Orsi, Tim Furche, Marcello Buoncristiano

Research output: Chapter in Book/Report/Conference proceedingConference contribution

8 Citations (Scopus)

Abstract

Automated web scraping is a popular means for acquiring data from the web. Scrapers (or wrappers) are derived from either manually or automatically annotated examples, often resulting in under/over segmented data, together with missing or spurious content. Automatic repair and maintenance of the extracted data is thus a necessary complement to automatic wrapper generation. Moreover, the extracted data is often the result of a long-term data acquisition effort and thus jointly repairing wrappers together with the generated data reduces future needs for data cleaning. We study the problem of computing joint repairs for XPath-based wrappers and their extracted data. We show that the problem is NP-complete in general but becomes tractable under a few natural assumptions. Even tractable solutions to the problem are still impractical on very large datasets, but we propose an optimal approximation that proves effective across a wide variety of domains and sources. Our approach relies on encoded domain knowledge, but require no per-source supervision. An evaluation spanning more than 100k web pages from 100 different sites of a wide variety of application domains, shows that joint repairs are able to increase the quality of wrappers between 15% and 60% independently of the wrapper generation system, eliminating all errors in more than 50% of the cases.
Original languageEnglish
Title of host publicationProceedings of the 32nd IEEE International Conference on Data Engineering (ICDE)
PublisherInstitute of Electrical and Electronics Engineers (IEEE)
Number of pages12
ISBN (Electronic)9781509020201
DOIs
Publication statusE-pub ahead of print - 23 Jun 2016
Event32nd IEEE International Conference on Data Engineering (ICDE) - Helsinki, Finland
Duration: 16 May 201620 May 2016

Conference

Conference32nd IEEE International Conference on Data Engineering (ICDE)
Country/TerritoryFinland
CityHelsinki
Period16/05/1620/05/16

Fingerprint

Dive into the research topics of 'Joint Repairs for Web Wrappers'. Together they form a unique fingerprint.

Cite this