Abstract
Automated web scraping is a popular means for acquiring data from the web. Scrapers (or wrappers) are derived from either manually or automatically annotated examples, often resulting in under/over segmented data, together with missing or spurious content. Automatic repair and maintenance of the extracted data is thus a necessary complement to automatic wrapper generation. Moreover, the extracted data is often the result of a long-term data acquisition effort and thus jointly repairing wrappers together with the generated data reduces future needs for data cleaning. We study the problem of computing joint repairs for XPath-based wrappers and their extracted data. We show that the problem is NP-complete in general but becomes tractable under a few natural assumptions. Even tractable solutions to the problem are still impractical on very large datasets, but we propose an optimal approximation that proves effective across a wide variety of domains and sources. Our approach relies on encoded domain knowledge, but require no per-source supervision. An evaluation spanning more than 100k web pages from 100 different sites of a wide variety of application domains, shows that joint repairs are able to increase the quality of wrappers between 15% and 60% independently of the wrapper generation system, eliminating all errors in more than 50% of the cases.
Original language | English |
---|---|
Title of host publication | Proceedings of the 32nd IEEE International Conference on Data Engineering (ICDE) |
Publisher | Institute of Electrical and Electronics Engineers (IEEE) |
Number of pages | 12 |
ISBN (Electronic) | 9781509020201 |
DOIs | |
Publication status | E-pub ahead of print - 23 Jun 2016 |
Event | 32nd IEEE International Conference on Data Engineering (ICDE) - Helsinki, Finland Duration: 16 May 2016 → 20 May 2016 |
Conference
Conference | 32nd IEEE International Conference on Data Engineering (ICDE) |
---|---|
Country/Territory | Finland |
City | Helsinki |
Period | 16/05/16 → 20/05/16 |