Capturing and querying fine-grained provenance of preprocessing pipelines in data science

Adriane Chapman, Paolo Missier, Giulia Simonelli, Riccardo Torlone

Research output: Contribution to journalArticlepeer-review

Abstract

Data processing pipelines that are designed to clean, transform and alter data in preparation for learning predictive models, have an impact on those models’ accuracy and performance, as well on other properties, such as model fairness. It is therefore important to provide developers with the means to gain an in-depth understanding of how the pipeline steps affect the data, from the raw input to training sets ready to be used for learning. While other efforts track creation and changes of pipelines of relational operators, in this work we analyze the typical operations of data preparation within a machine learning process, and provide infrastructure for generating very granular provenance records from it, at the level of individual elements within a dataset. Our contributions include: (i) the formal definition of a core set of preprocessing operators, and the definition of provenance patterns for each of them, and (ii) a prototype implementation of an application-level provenance capture library that works alongside Python. We report on provenance processing and storage overhead and scalability experiments, carried out over both real ML benchmark pipelines and over TCP-DI, and show how the resulting provenance can be used to answer a suite of provenance benchmark queries that underpin some of the developers’ debugging questions, as expressed on the Data Science Stack Exchange.

Original languageEnglish
Pages (from-to)507-520
Number of pages14
JournalProceedings of the VLDB Endowment
Volume14
Issue number4
DOIs
Publication statusPublished - 2020

Bibliographical note

Funding Information:
The authors thank Carlos Vladimiro Gonzales for making his research pipelines available for our experiments. This work was partially funded by EPSRC (EP/SO28366/1).

Publisher Copyright:
© VLDB Endowment. All rights reserved.

ASJC Scopus subject areas

  • Computer Science (miscellaneous)
  • General Computer Science

Fingerprint

Dive into the research topics of 'Capturing and querying fine-grained provenance of preprocessing pipelines in data science'. Together they form a unique fingerprint.

Cite this