The University of Southampton
University of Southampton Institutional Repository

Capturing and querying fine-grained provenance of preprocessing pipelines in data science

Capturing and querying fine-grained provenance of preprocessing pipelines in data science
Capturing and querying fine-grained provenance of preprocessing pipelines in data science

Data processing pipelines that are designed to clean, transform and alter data in preparation for learning predictive models, have an impact on those models’ accuracy and performance, as well on other properties, such as model fairness. It is therefore important to provide developers with the means to gain an in-depth understanding of how the pipeline steps affect the data, from the raw input to training sets ready to be used for learning. While other efforts track creation and changes of pipelines of relational operators, in this work we analyze the typical operations of data preparation within a machine learning process, and provide infrastructure for generating very granular provenance records from it, at the level of individual elements within a dataset. Our contributions include: (i) the formal definition of a core set of preprocessing operators, and the definition of provenance patterns for each of them, and (ii) a prototype implementation of an application-level provenance capture library that works alongside Python. We report on provenance processing and storage overhead and scalability experiments, carried out over both real ML benchmark pipelines and over TCP-DI, and show how the resulting provenance can be used to answer a suite of provenance benchmark queries that underpin some of the developers’ debugging questions, as expressed on the Data Science Stack Exchange.

2150-8097
507-520
Chapman, Adriane
721b7321-8904-4be2-9b01-876c430743f1
Missier, Paolo
784fec92-c5a0-466e-92be-1f9cc43b697f
Simonelli, Giulia
9c31c5ce-bbc1-43ea-a352-045698323ee5
Torlone, Riccardo
27acf5ad-ff5a-4266-8c26-aa33452891bb
Chapman, Adriane
721b7321-8904-4be2-9b01-876c430743f1
Missier, Paolo
784fec92-c5a0-466e-92be-1f9cc43b697f
Simonelli, Giulia
9c31c5ce-bbc1-43ea-a352-045698323ee5
Torlone, Riccardo
27acf5ad-ff5a-4266-8c26-aa33452891bb

Chapman, Adriane, Missier, Paolo, Simonelli, Giulia and Torlone, Riccardo (2020) Capturing and querying fine-grained provenance of preprocessing pipelines in data science. Proceedings of the VLDB Endowment, 14 (4), 507-520. (doi:10.14778/3436905.3436911).

Record type: Article

Abstract

Data processing pipelines that are designed to clean, transform and alter data in preparation for learning predictive models, have an impact on those models’ accuracy and performance, as well on other properties, such as model fairness. It is therefore important to provide developers with the means to gain an in-depth understanding of how the pipeline steps affect the data, from the raw input to training sets ready to be used for learning. While other efforts track creation and changes of pipelines of relational operators, in this work we analyze the typical operations of data preparation within a machine learning process, and provide infrastructure for generating very granular provenance records from it, at the level of individual elements within a dataset. Our contributions include: (i) the formal definition of a core set of preprocessing operators, and the definition of provenance patterns for each of them, and (ii) a prototype implementation of an application-level provenance capture library that works alongside Python. We report on provenance processing and storage overhead and scalability experiments, carried out over both real ML benchmark pipelines and over TCP-DI, and show how the resulting provenance can be used to answer a suite of provenance benchmark queries that underpin some of the developers’ debugging questions, as expressed on the Data Science Stack Exchange.

Text
ChapmanVLDB2021 - Accepted Manuscript
Download (1MB)

More information

Published date: December 2020
Additional Information: Funding Information: The authors thank Carlos Vladimiro Gonzales for making his research pipelines available for our experiments. This work was partially funded by EPSRC (EP/SO28366/1). Publisher Copyright: © VLDB Endowment. All rights reserved. Copyright: Copyright 2021 Elsevier B.V., All rights reserved.

Identifiers

Local EPrints ID: 449939
URI: http://eprints.soton.ac.uk/id/eprint/449939
ISSN: 2150-8097
PURE UUID: 6dde30eb-be71-4653-97e2-5a428802c332
ORCID for Adriane Chapman: ORCID iD orcid.org/0000-0002-3814-2587

Catalogue record

Date deposited: 28 Jun 2021 16:32
Last modified: 29 Jun 2021 01:53

Export record

Altmetrics

Contributors

Author: Adriane Chapman ORCID iD
Author: Paolo Missier
Author: Giulia Simonelli
Author: Riccardo Torlone

University divisions

Download statistics

Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.

View more statistics

Atom RSS 1.0 RSS 2.0

Contact ePrints Soton: eprints@soton.ac.uk

ePrints Soton supports OAI 2.0 with a base URL of http://eprints.soton.ac.uk/cgi/oai2

This repository has been built using EPrints software, developed at the University of Southampton, but available to everyone to use.

We use cookies to ensure that we give you the best experience on our website. If you continue without changing your settings, we will assume that you are happy to receive cookies on the University of Southampton website.

×