The University of Southampton
University of Southampton Institutional Repository

DPDS: assisting data science with data provenance

DPDS: assisting data science with data provenance
DPDS: assisting data science with data provenance
Successful data-driven science requires a complex combination of data engineering pipelines and data modelling techniques. Robust and defensible results can only be achieved when each step in the pipeline that is designed to clean, transform and alter data in preparation for data modelling can be justified, and its effect on the data explained. The DPDS toolkit presented in this paper is
designed to make such justification and explanation process an integral part of data science practice, adding value while remaining as un-intrusive as possible to the analyst. Catering to the broad community of python/pandas data engineers, DPDS implements an observer pattern that is able to capture the fine-grained provenance associated with each individual element of a dataframe, across
multiple transformation steps. The resulting provenance graph is stored in Neo4j and queried through a UI, with the goal of helping engineers and analysts to justify and explain their choice of data operations, from raw data to model training, by highlighting the details of the changes through each transformation.
2150-8097
3614–3617
Chapman, Adriane
721b7321-8904-4be2-9b01-876c430743f1
Lauro, Luca
c3df123c-d2e7-4542-9faf-02b7de9eec90
Missier, Paolo
9ba4febc-4d6e-4bbb-bdc1-ac1c673ff3af
Torlone, Riccardo
91f9c399-0051-40dc-a67c-3cd0ac4ea24d
Chapman, Adriane
721b7321-8904-4be2-9b01-876c430743f1
Lauro, Luca
c3df123c-d2e7-4542-9faf-02b7de9eec90
Missier, Paolo
9ba4febc-4d6e-4bbb-bdc1-ac1c673ff3af
Torlone, Riccardo
91f9c399-0051-40dc-a67c-3cd0ac4ea24d

Chapman, Adriane, Lauro, Luca, Missier, Paolo and Torlone, Riccardo (2022) DPDS: assisting data science with data provenance. Proceedings of the VLDB Endowment, 15 (12), 3614–3617. (doi:10.14778/3554821.3554857).

Record type: Article

Abstract

Successful data-driven science requires a complex combination of data engineering pipelines and data modelling techniques. Robust and defensible results can only be achieved when each step in the pipeline that is designed to clean, transform and alter data in preparation for data modelling can be justified, and its effect on the data explained. The DPDS toolkit presented in this paper is
designed to make such justification and explanation process an integral part of data science practice, adding value while remaining as un-intrusive as possible to the analyst. Catering to the broad community of python/pandas data engineers, DPDS implements an observer pattern that is able to capture the fine-grained provenance associated with each individual element of a dataframe, across
multiple transformation steps. The resulting provenance graph is stored in Neo4j and queried through a UI, with the goal of helping engineers and analysts to justify and explain their choice of data operations, from raw data to model training, by highlighting the details of the changes through each transformation.

Text
3554821.3554857 - Version of Record
Download (1MB)

More information

Published date: 1 August 2022

Identifiers

Local EPrints ID: 477286
URI: http://eprints.soton.ac.uk/id/eprint/477286
ISSN: 2150-8097
PURE UUID: 928629c4-1ac7-4b03-9eb3-c213b9e67d77
ORCID for Adriane Chapman: ORCID iD orcid.org/0000-0002-3814-2587

Catalogue record

Date deposited: 02 Jun 2023 16:34
Last modified: 17 Mar 2024 03:46

Export record

Altmetrics

Contributors

Author: Adriane Chapman ORCID iD
Author: Luca Lauro
Author: Paolo Missier
Author: Riccardo Torlone

Download statistics

Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.

View more statistics

Atom RSS 1.0 RSS 2.0

Contact ePrints Soton: eprints@soton.ac.uk

ePrints Soton supports OAI 2.0 with a base URL of http://eprints.soton.ac.uk/cgi/oai2

This repository has been built using EPrints software, developed at the University of Southampton, but available to everyone to use.

We use cookies to ensure that we give you the best experience on our website. If you continue without changing your settings, we will assume that you are happy to receive cookies on the University of Southampton website.

×