Tracing fine-grained provenance in stream processing systems
using a reverse mapping method.
University of Southampton, Faculty of Physical and Applied Sciences,
Applications that require continuous processing of high-volume data streams have grown in prevalence and importance. These kinds of system often process streaming data in real-time or near real-time and provide instantaneous responses in order to support a precise and on time decision. In such systems it is difficult to know exactly how a particular result is generated. However, such information is extremely important for the validation and veri�cation of stream processing results. Therefore, it is crucial that stream processing systems have a mechanism for tracking provenance - the information pertaining to the process that produced result data - at the level of individual stream elements which we refer to as fine-grained provenance tracking for streams. The traceability of stream processing systems allows for users to validate individual stream elements, to verify the computation that took place and to understand the chain of reasoning that was used in the production of a stream processing result. Several recent solutions to provenance tracking in stream processing systems mainly focus on coarse-grained stream provenance in which the level of granularity for capturing provenance information is not detailed enough to address our problem. This thesis proposes a novel fine-grained provenance solution for streams that exploits a reverse mapping method to precisely capture dependency relationships for every individual stream element. It is also designed to support a stream-specific provenance query mechanism, which performs provenance queries dynamically over streams of provenance assertions without requiring the assertions to be stored persistently. The dissertation makes four major contributions to the state of the art. First is a provenance model for streams that allows for the provenance of individual stream elements to be obtained. Second is a provenance query method which utilizes a reverse mapping method - stream ancestor functions - in order to obtain the provenance of a particular stream processing result. The third contribution is a stream-specific provenance query mechanism that enables provenance queries to be computed on-the-fly without requiring provenance assertions to be stored persistently. The fourth contribution is the performance characteristics of our stream provenance solution. It is shown that the storage overhead for provenance collection can be reduced significantly by using our storage reduction technique and the marginal cost of storage consumption is constant based on the number of input stream events. A 4% overhead for the persistent provenance approach and a 7% overhead for the stream-specific query approach are observed as the impact of provenance recording on system performance. In addition, our stream-specific query approach offers low latency processing (0.3 ms per additional component) with reasonable memory consumption.
Actions (login required)