Capturing and querying fine-grained provenance of preprocessing pipelines in data science
Capturing and querying fine-grained provenance of preprocessing pipelines in data science
Data processing pipelines that are designed to clean, transform and alter data in preparation for learning predictive models, have an impact on those models’ accuracy and performance, as well on other properties, such as model fairness. It is therefore important to provide developers with the means to gain an in-depth understanding of how the pipeline steps affect the data, from the raw input to training sets ready to be used for learning. While other efforts track creation and changes of pipelines of relational operators, in this work we analyze the typical operations of data preparation within a machine learning process, and provide infrastructure for generating very granular provenance records from it, at the level of individual elements within a dataset. Our contributions include: (i) the formal definition of a core set of preprocessing operators, and the definition of provenance patterns for each of them, and (ii) a prototype implementation of an application-level provenance capture library that works alongside Python. We report on provenance processing and storage overhead and scalability experiments, carried out over both real ML benchmark pipelines and over TCP-DI, and show how the resulting provenance can be used to answer a suite of provenance benchmark queries that underpin some of the developers’ debugging questions, as expressed on the Data Science Stack Exchange.
507-520
Chapman, Adriane
721b7321-8904-4be2-9b01-876c430743f1
Missier, Paolo
784fec92-c5a0-466e-92be-1f9cc43b697f
Simonelli, Giulia
9c31c5ce-bbc1-43ea-a352-045698323ee5
Torlone, Riccardo
27acf5ad-ff5a-4266-8c26-aa33452891bb
December 2020
Chapman, Adriane
721b7321-8904-4be2-9b01-876c430743f1
Missier, Paolo
784fec92-c5a0-466e-92be-1f9cc43b697f
Simonelli, Giulia
9c31c5ce-bbc1-43ea-a352-045698323ee5
Torlone, Riccardo
27acf5ad-ff5a-4266-8c26-aa33452891bb
Chapman, Adriane, Missier, Paolo, Simonelli, Giulia and Torlone, Riccardo
(2020)
Capturing and querying fine-grained provenance of preprocessing pipelines in data science.
Proceedings of the VLDB Endowment, 14 (4), .
(doi:10.14778/3436905.3436911).
Abstract
Data processing pipelines that are designed to clean, transform and alter data in preparation for learning predictive models, have an impact on those models’ accuracy and performance, as well on other properties, such as model fairness. It is therefore important to provide developers with the means to gain an in-depth understanding of how the pipeline steps affect the data, from the raw input to training sets ready to be used for learning. While other efforts track creation and changes of pipelines of relational operators, in this work we analyze the typical operations of data preparation within a machine learning process, and provide infrastructure for generating very granular provenance records from it, at the level of individual elements within a dataset. Our contributions include: (i) the formal definition of a core set of preprocessing operators, and the definition of provenance patterns for each of them, and (ii) a prototype implementation of an application-level provenance capture library that works alongside Python. We report on provenance processing and storage overhead and scalability experiments, carried out over both real ML benchmark pipelines and over TCP-DI, and show how the resulting provenance can be used to answer a suite of provenance benchmark queries that underpin some of the developers’ debugging questions, as expressed on the Data Science Stack Exchange.
Text
ChapmanVLDB2021
- Accepted Manuscript
More information
Published date: December 2020
Additional Information:
Funding Information:
The authors thank Carlos Vladimiro Gonzales for making his research pipelines available for our experiments. This work was partially funded by EPSRC (EP/SO28366/1).
Publisher Copyright:
© VLDB Endowment. All rights reserved.
Copyright:
Copyright 2021 Elsevier B.V., All rights reserved.
Identifiers
Local EPrints ID: 449939
URI: http://eprints.soton.ac.uk/id/eprint/449939
ISSN: 2150-8097
PURE UUID: 6dde30eb-be71-4653-97e2-5a428802c332
Catalogue record
Date deposited: 28 Jun 2021 16:32
Last modified: 06 Jun 2024 01:59
Export record
Altmetrics
Contributors
Author:
Paolo Missier
Author:
Giulia Simonelli
Author:
Riccardo Torlone
Download statistics
Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.
View more statistics