The University of Southampton
University of Southampton Institutional Repository

Tracing fine-grained provenance in stream processing systems using a reverse mapping method

Tracing fine-grained provenance in stream processing systems using a reverse mapping method
Tracing fine-grained provenance in stream processing systems using a reverse mapping method
Applications that require continuous processing of high-volume data streams have grown in prevalence and importance. These kinds of system often process streaming data in real-time or near real-time and provide instantaneous responses in order to support a precise and on time decision. In such systems it is difficult to know exactly how a particular result is generated. However, such information is extremely important for the validation and verification of stream processing results. Therefore, it is crucial that stream processing systems have a mechanism for tracking provenance - the information pertaining to the process that produced result data - at the level of individual stream elements which we refer to as fine-grained provenance tracking for streams. The traceability of stream processing systems allows for users to validate individual stream elements, to verify the computation that took place and to understand the chain of reasoning that was used in the production of a stream processing result. Several recent solutions to provenance tracking in stream processing systems mainly focus on coarse-grained stream provenance in which the level of granularity for capturing provenance information is not detailed enough to address our problem. This thesis proposes a novel fine-grained provenance solution for streams that exploits a reverse mapping method to precisely capture dependency relationships for every individual stream element. It is also designed to support a stream-specific provenance query mechanism, which performs provenance queries dynamically over streams of provenance assertions without requiring the assertions to be stored persistently. The dissertation makes four major contributions to the state of the art. First is a provenance model for streams that allows for the provenance of individual stream elements to be obtained. Second is a provenance query method which utilizes a reverse mapping method - stream ancestor functions - in order to obtain the provenance of a particular stream processing result. The third contribution is a stream-specific provenance query mechanism that enables provenance queries to be computed on-the-fly without requiring provenance assertions to be stored persistently. The fourth contribution is the performance characteristics of our stream provenance solution. It is shown that the storage overhead for provenance collection can be reduced significantly by using our storage reduction technique and the marginal cost of storage consumption is constant based on the number of input stream events. A 4% overhead for the persistent provenance approach and a 7% overhead for the stream-specific query approach are observed as the impact of provenance recording on system performance. In addition, our stream-specific query approach offers low latency processing (0.3 ms per additional component) with reasonable memory consumption.
Sansrimahachai, Watsawee
49e185c7-e55c-4c40-9d97-ca7b861d7e94
Sansrimahachai, Watsawee
49e185c7-e55c-4c40-9d97-ca7b861d7e94
Moreau, Luc
033c63dd-3fe9-4040-849f-dfccbe0406f8
Weal, Mark J.
e8fd30a6-c060-41c5-b388-ca52c81032a4

Sansrimahachai, Watsawee (2012) Tracing fine-grained provenance in stream processing systems using a reverse mapping method. University of Southampton, Faculty of Physical and Applied Sciences, Doctoral Thesis, 203pp.

Record type: Thesis (Doctoral)

Abstract

Applications that require continuous processing of high-volume data streams have grown in prevalence and importance. These kinds of system often process streaming data in real-time or near real-time and provide instantaneous responses in order to support a precise and on time decision. In such systems it is difficult to know exactly how a particular result is generated. However, such information is extremely important for the validation and verification of stream processing results. Therefore, it is crucial that stream processing systems have a mechanism for tracking provenance - the information pertaining to the process that produced result data - at the level of individual stream elements which we refer to as fine-grained provenance tracking for streams. The traceability of stream processing systems allows for users to validate individual stream elements, to verify the computation that took place and to understand the chain of reasoning that was used in the production of a stream processing result. Several recent solutions to provenance tracking in stream processing systems mainly focus on coarse-grained stream provenance in which the level of granularity for capturing provenance information is not detailed enough to address our problem. This thesis proposes a novel fine-grained provenance solution for streams that exploits a reverse mapping method to precisely capture dependency relationships for every individual stream element. It is also designed to support a stream-specific provenance query mechanism, which performs provenance queries dynamically over streams of provenance assertions without requiring the assertions to be stored persistently. The dissertation makes four major contributions to the state of the art. First is a provenance model for streams that allows for the provenance of individual stream elements to be obtained. Second is a provenance query method which utilizes a reverse mapping method - stream ancestor functions - in order to obtain the provenance of a particular stream processing result. The third contribution is a stream-specific provenance query mechanism that enables provenance queries to be computed on-the-fly without requiring provenance assertions to be stored persistently. The fourth contribution is the performance characteristics of our stream provenance solution. It is shown that the storage overhead for provenance collection can be reduced significantly by using our storage reduction technique and the marginal cost of storage consumption is constant based on the number of input stream events. A 4% overhead for the persistent provenance approach and a 7% overhead for the stream-specific query approach are observed as the impact of provenance recording on system performance. In addition, our stream-specific query approach offers low latency processing (0.3 ms per additional component) with reasonable memory consumption.

Text
WatsaweeThesis_final.pdf - Other
Download (2MB)

More information

Published date: April 2012
Organisations: University of Southampton, Web & Internet Science

Identifiers

Local EPrints ID: 337675
URI: http://eprints.soton.ac.uk/id/eprint/337675
PURE UUID: 11e035ce-8bf5-4158-a4fb-b03ab245dd85
ORCID for Luc Moreau: ORCID iD orcid.org/0000-0002-3494-120X
ORCID for Mark J. Weal: ORCID iD orcid.org/0000-0001-6251-8786

Catalogue record

Date deposited: 27 Jun 2012 10:36
Last modified: 15 Mar 2024 02:46

Export record

Contributors

Author: Watsawee Sansrimahachai
Thesis advisor: Luc Moreau ORCID iD
Thesis advisor: Mark J. Weal ORCID iD

Download statistics

Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.

View more statistics

Atom RSS 1.0 RSS 2.0

Contact ePrints Soton: eprints@soton.ac.uk

ePrints Soton supports OAI 2.0 with a base URL of http://eprints.soton.ac.uk/cgi/oai2

This repository has been built using EPrints software, developed at the University of Southampton, but available to everyone to use.

We use cookies to ensure that we give you the best experience on our website. If you continue without changing your settings, we will assume that you are happy to receive cookies on the University of Southampton website.

×