Reducing streaming data storage required for provenance retrieval using Fourier Transform
Reducing streaming data storage required for provenance retrieval using Fourier Transform
In this work, we investigate existing works on provenance in the streaming environment. Despite the various reduction techniques proposed for provenance or stream storage, the storage for whole source and intermediate streams is always necessary to answer how-provenance (”show me the data and operations that lead to the output data”). This makes the size of streaming data required for provenance retrieval unworkably large. In this work, we investigate a method for manipulating the streams that provides information to answer how-provenance without pre-determining what information to keep and what to remove. We look at the Fourier transform (FT) as a tool to encode portions of the streaming data information for provenance retrieval. We use a real-world, respiratory streaming use case to highlight the needs for provenance information. We build our stream reduction model and test it against the use case. The experiments show that FT can reduce the size of streaming data (in our demonstration of the technique over the second one-minute time window, it leads to a 15.7 times reduction effect for eligible streams for a streaming application to get the respiration rate and a 36.6 times reduction effect for eligible streams for a streaming application to find the best position), yet the utility of the streaming data for provenance retrieval comes with some limitation. While using the FT technique doesn’t affect the answerability of the query that requires stream ID but not specific data (such as PQ1 for the use case), the query that requires examining data content (such as PQ2 for the use case), and the query that requires contributing operators (such as PQ3 for the use case), for the query that requires returning the figure of the stream (such as PQ4 and PQ5 for the use case), it’s possible that there can be some pattern losses. The post-processing time for respiration sensor data from the second one-minute window is reasonable enough (0.873 seconds for a streaming application to get the respiration rate and 2.816 seconds for a streaming application to find the best position) to support more reactive provenance retrieval which is usually desirable for stream processing. While there is no significant difference between the medians of the query time using unreduced streaming data and reduced streaming data for PQ1 which doesn’t require specific data at a 5% significance level, there exists a positive shift in the median of query time from the query using original data to query using reduced data at the 5 % significance level for PQ2, PQ4, and PQ5 which requires reconstruction of specific contributing data. The use of the FT technique doesn’t affect the query time for PQ3 as the contributing operators can be retrieved from the table storing the metadata.
University of Southampton
Huang, Zheng
24da4cdd-8d95-482c-ac25-f8a8e6310a1f
January 2023
Huang, Zheng
24da4cdd-8d95-482c-ac25-f8a8e6310a1f
Chapman, Age
721b7321-8904-4be2-9b01-876c430743f1
Huang, Zheng
(2023)
Reducing streaming data storage required for provenance retrieval using Fourier Transform.
University of Southampton, Doctoral Thesis, 135pp.
Record type:
Thesis
(Doctoral)
Abstract
In this work, we investigate existing works on provenance in the streaming environment. Despite the various reduction techniques proposed for provenance or stream storage, the storage for whole source and intermediate streams is always necessary to answer how-provenance (”show me the data and operations that lead to the output data”). This makes the size of streaming data required for provenance retrieval unworkably large. In this work, we investigate a method for manipulating the streams that provides information to answer how-provenance without pre-determining what information to keep and what to remove. We look at the Fourier transform (FT) as a tool to encode portions of the streaming data information for provenance retrieval. We use a real-world, respiratory streaming use case to highlight the needs for provenance information. We build our stream reduction model and test it against the use case. The experiments show that FT can reduce the size of streaming data (in our demonstration of the technique over the second one-minute time window, it leads to a 15.7 times reduction effect for eligible streams for a streaming application to get the respiration rate and a 36.6 times reduction effect for eligible streams for a streaming application to find the best position), yet the utility of the streaming data for provenance retrieval comes with some limitation. While using the FT technique doesn’t affect the answerability of the query that requires stream ID but not specific data (such as PQ1 for the use case), the query that requires examining data content (such as PQ2 for the use case), and the query that requires contributing operators (such as PQ3 for the use case), for the query that requires returning the figure of the stream (such as PQ4 and PQ5 for the use case), it’s possible that there can be some pattern losses. The post-processing time for respiration sensor data from the second one-minute window is reasonable enough (0.873 seconds for a streaming application to get the respiration rate and 2.816 seconds for a streaming application to find the best position) to support more reactive provenance retrieval which is usually desirable for stream processing. While there is no significant difference between the medians of the query time using unreduced streaming data and reduced streaming data for PQ1 which doesn’t require specific data at a 5% significance level, there exists a positive shift in the median of query time from the query using original data to query using reduced data at the 5 % significance level for PQ2, PQ4, and PQ5 which requires reconstruction of specific contributing data. The use of the FT technique doesn’t affect the query time for PQ3 as the contributing operators can be retrieved from the table storing the metadata.
Text
Zheng Huang_MPhil_WAIS_22_Jan_2023
- Version of Record
Text
PTD Thesis Huang SIGNED
Restricted to Repository staff only
More information
Published date: January 2023
Identifiers
Local EPrints ID: 473995
URI: http://eprints.soton.ac.uk/id/eprint/473995
PURE UUID: 0458ba38-6ec5-4831-8b74-7ad19aff1aed
Catalogue record
Date deposited: 08 Feb 2023 17:32
Last modified: 17 Mar 2024 03:46
Export record
Contributors
Author:
Zheng Huang
Download statistics
Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.
View more statistics