Fine-grained provenance for high-quality data science

Adriane Chapman; Paolo Missier; Giulia Simonelli; Riccardo Torlone

Fine-grained provenance for high-quality data science

Adriane Chapman^*, Paolo Missier, Giulia Simonelli, Riccardo Torlone

^*Corresponding author for this work

Computer Science

Research output: Contribution to journal › Conference article › peer-review

Abstract

In this work we analyze the typical operations of data preparation within a machine learning process, and provide infrastructure for generating very granular provenance records from it, at the level of individual elements within a dataset. Our contributions include: (i) the formal definition of a core set of preprocessing operators, (ii) the definition of provenance patterns for each of them, and (iii) a prototype implementation of an application-level provenance capture library that works alongside Python.

Original language	English
Pages (from-to)	411-418
Journal	CEUR Workshop Proceedings
Volume	2994
Publication status	Published - 9 Sept 2021
Event	29th Italian Symposium on Advanced Database Systems, SEBD 2021 - Pizzo Calabro, Italy Duration: 5 Sept 2021 → 9 Sept 2021

Bibliographical note

Publisher Copyright:
© 2021 Copyright for this paper by its authors.

ASJC Scopus subject areas

General Computer Science

Access to Document

ChapmanA2021Fine-grainedFinal published version, 975 KBLicence: Creative Commons: Attribution (CC BY)

https://ceur-ws.org/Vol-2994/paper46.pdfLicence: Creative Commons: Attribution (CC BY)

Cite this

@article{fd50fb23b0dd48dcb7f7ff1a4b693062,

title = "Fine-grained provenance for high-quality data science",

abstract = "In this work we analyze the typical operations of data preparation within a machine learning process, and provide infrastructure for generating very granular provenance records from it, at the level of individual elements within a dataset. Our contributions include: (i) the formal definition of a core set of preprocessing operators, (ii) the definition of provenance patterns for each of them, and (iii) a prototype implementation of an application-level provenance capture library that works alongside Python.",

author = "Adriane Chapman and Paolo Missier and Giulia Simonelli and Riccardo Torlone",

note = "Publisher Copyright: {\textcopyright} 2021 Copyright for this paper by its authors.; 29th Italian Symposium on Advanced Database Systems, SEBD 2021 ; Conference date: 05-09-2021 Through 09-09-2021",

year = "2021",

month = sep,

day = "9",

language = "English",

volume = "2994",

pages = "411--418",

journal = "CEUR Workshop Proceedings",

issn = "1613-0073",

publisher = "CEUR-WS.org",

}

TY - JOUR

T1 - Fine-grained provenance for high-quality data science

AU - Chapman, Adriane

AU - Missier, Paolo

AU - Simonelli, Giulia

AU - Torlone, Riccardo

PY - 2021/9/9

Y1 - 2021/9/9

N2 - In this work we analyze the typical operations of data preparation within a machine learning process, and provide infrastructure for generating very granular provenance records from it, at the level of individual elements within a dataset. Our contributions include: (i) the formal definition of a core set of preprocessing operators, (ii) the definition of provenance patterns for each of them, and (iii) a prototype implementation of an application-level provenance capture library that works alongside Python.

AB - In this work we analyze the typical operations of data preparation within a machine learning process, and provide infrastructure for generating very granular provenance records from it, at the level of individual elements within a dataset. Our contributions include: (i) the formal definition of a core set of preprocessing operators, (ii) the definition of provenance patterns for each of them, and (iii) a prototype implementation of an application-level provenance capture library that works alongside Python.

UR - http://www.scopus.com/inward/record.url?scp=85118775134&partnerID=8YFLogxK

UR - https://ceur-ws.org/Vol-2994/

M3 - Conference article

AN - SCOPUS:85118775134

SN - 1613-0073

VL - 2994

SP - 411

EP - 418

JO - CEUR Workshop Proceedings

JF - CEUR Workshop Proceedings

T2 - 29th Italian Symposium on Advanced Database Systems, SEBD 2021

Y2 - 5 September 2021 through 9 September 2021

ER -

Fine-grained provenance for high-quality data science

Abstract

Bibliographical note

ASJC Scopus subject areas

Access to Document

Fingerprint

Cite this