Tracing fine-grained provenance in stream processing systems using a reverse mapping method
Tracing fine-grained provenance in stream processing systems using a reverse mapping method
Applications that require continuous processing of high-volume data streams have grown in prevalence and importance. These kinds of system often process streaming data in real-time or near real-time and provide instantaneous responses in order to support a precise and on time decision. In such systems it is difficult to know exactly how a particular result is generated. However, such information is extremely important for the validation and verification of stream processing results. Therefore, it is crucial that stream processing systems have a mechanism for tracking provenance - the information pertaining to the process that produced result data - at the level of individual stream elements which we refer to as fine-grained provenance tracking for streams. The traceability of stream processing systems allows for users to validate individual stream elements, to verify the computation that took place and to understand the chain of reasoning that was used in the production of a stream processing result. Several recent solutions to provenance tracking in stream processing systems mainly focus on coarse-grained stream provenance in which the level of granularity for capturing provenance information is not detailed enough to address our problem. This thesis proposes a novel fine-grained provenance solution for streams that exploits a reverse mapping method to precisely capture dependency relationships for every individual stream element. It is also designed to support a stream-specific provenance query mechanism, which performs provenance queries dynamically over streams of provenance assertions without requiring the assertions to be stored persistently. The dissertation makes four major contributions to the state of the art. First is a provenance model for streams that allows for the provenance of individual stream elements to be obtained. Second is a provenance query method which utilizes a reverse mapping method - stream ancestor functions - in order to obtain the provenance of a particular stream processing result. The third contribution is a stream-specific provenance query mechanism that enables provenance queries to be computed on-the-fly without requiring provenance assertions to be stored persistently. The fourth contribution is the performance characteristics of our stream provenance solution. It is shown that the storage overhead for provenance collection can be reduced significantly by using our storage reduction technique and the marginal cost of storage consumption is constant based on the number of input stream events. A 4% overhead for the persistent provenance approach and a 7% overhead for the stream-specific query approach are observed as the impact of provenance recording on system performance. In addition, our stream-specific query approach offers low latency processing (0.3 ms per additional component) with reasonable memory consumption.
Sansrimahachai, Watsawee
49e185c7-e55c-4c40-9d97-ca7b861d7e94
April 2012
Sansrimahachai, Watsawee
49e185c7-e55c-4c40-9d97-ca7b861d7e94
Moreau, Luc
033c63dd-3fe9-4040-849f-dfccbe0406f8
Weal, Mark J.
e8fd30a6-c060-41c5-b388-ca52c81032a4
Sansrimahachai, Watsawee
(2012)
Tracing fine-grained provenance in stream processing systems using a reverse mapping method.
University of Southampton, Faculty of Physical and Applied Sciences, Doctoral Thesis, 203pp.
Record type:
Thesis
(Doctoral)
Abstract
Applications that require continuous processing of high-volume data streams have grown in prevalence and importance. These kinds of system often process streaming data in real-time or near real-time and provide instantaneous responses in order to support a precise and on time decision. In such systems it is difficult to know exactly how a particular result is generated. However, such information is extremely important for the validation and verification of stream processing results. Therefore, it is crucial that stream processing systems have a mechanism for tracking provenance - the information pertaining to the process that produced result data - at the level of individual stream elements which we refer to as fine-grained provenance tracking for streams. The traceability of stream processing systems allows for users to validate individual stream elements, to verify the computation that took place and to understand the chain of reasoning that was used in the production of a stream processing result. Several recent solutions to provenance tracking in stream processing systems mainly focus on coarse-grained stream provenance in which the level of granularity for capturing provenance information is not detailed enough to address our problem. This thesis proposes a novel fine-grained provenance solution for streams that exploits a reverse mapping method to precisely capture dependency relationships for every individual stream element. It is also designed to support a stream-specific provenance query mechanism, which performs provenance queries dynamically over streams of provenance assertions without requiring the assertions to be stored persistently. The dissertation makes four major contributions to the state of the art. First is a provenance model for streams that allows for the provenance of individual stream elements to be obtained. Second is a provenance query method which utilizes a reverse mapping method - stream ancestor functions - in order to obtain the provenance of a particular stream processing result. The third contribution is a stream-specific provenance query mechanism that enables provenance queries to be computed on-the-fly without requiring provenance assertions to be stored persistently. The fourth contribution is the performance characteristics of our stream provenance solution. It is shown that the storage overhead for provenance collection can be reduced significantly by using our storage reduction technique and the marginal cost of storage consumption is constant based on the number of input stream events. A 4% overhead for the persistent provenance approach and a 7% overhead for the stream-specific query approach are observed as the impact of provenance recording on system performance. In addition, our stream-specific query approach offers low latency processing (0.3 ms per additional component) with reasonable memory consumption.
Text
WatsaweeThesis_final.pdf
- Other
More information
Published date: April 2012
Organisations:
University of Southampton, Web & Internet Science
Identifiers
Local EPrints ID: 337675
URI: http://eprints.soton.ac.uk/id/eprint/337675
PURE UUID: 11e035ce-8bf5-4158-a4fb-b03ab245dd85
Catalogue record
Date deposited: 27 Jun 2012 10:36
Last modified: 15 Mar 2024 02:46
Export record
Contributors
Author:
Watsawee Sansrimahachai
Thesis advisor:
Luc Moreau
Thesis advisor:
Mark J. Weal
Download statistics
Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.
View more statistics