Efficient video recognition with convolutional neural networks by exploiting temporal correlation in video data
Efficient video recognition with convolutional neural networks by exploiting temporal correlation in video data
Object detection in images is one of the most successful applications of convolutional neural networks (CNNs). However, applying deep CNNs to large numbers of video frames has recently emerged as a new challenge beyond image data due to the high computational requirements. Due to their similar appearances, CNNs often extract similar features from video frames. Conventional video object detection pipelines extract features of individual frames with a fixed computational effort, resulting in numerous redundant computations and an inefficient use of energy resources, particularly for edge computing. By exploiting frame-to-frame similarity, this thesis shows that the computational complexity of video object detection pipelines can be reduced. Similarity-aware CNNs are proposed to identify and avoid computations on similar feature pixels across frames. The proposed similarity-aware quantization scheme (SQS) increases the average number of unchanged feature pixels across frame pairs by up to 85% with a loss of less than 1% in detection accuracy. Second, by minimising redundant computations and memory accesses across frame pairs, the proposed similarity-aware row stationary (SRS) dataflow reduces energy consumption. According to simulation experiments, the proposed dataflow reduces video frame processing energy consumption by up to 30%. To further improve the efficiency of video object detection, a new temporal early exit module (TEEM) is proposed. Semantic differences between consecutive frames can be detected using TEEM with low computation overhead, avoiding redundant video frame feature extraction. Multiple TEEMs are inserted into the pipelines’ feature network at various early layers. TEEM-enabled pipelines only require full computation effort when a frame is determined to be semantically distinct from previous frames; otherwise, previous frame detection results are reused. Experiments on ImangeNet VID and TVnet demonstrate that TEEMs accelerate SOTA video object detection pipelines by 1.7× while maintaining a < 1% mean average precision reduction.
University of Southampton
Sabetsarvestani, Mohammadamin
f5c0e55f-6f0c-4f56-9d6d-7de19d6fb136
December 2022
Sabetsarvestani, Mohammadamin
f5c0e55f-6f0c-4f56-9d6d-7de19d6fb136
Merrett, Geoffrey
89b3a696-41de-44c3-89aa-b0aa29f54020
Sabetsarvestani, Mohammadamin
(2022)
Efficient video recognition with convolutional neural networks by exploiting temporal correlation in video data.
University of Southampton, Doctoral Thesis, 164pp.
Record type:
Thesis
(Doctoral)
Abstract
Object detection in images is one of the most successful applications of convolutional neural networks (CNNs). However, applying deep CNNs to large numbers of video frames has recently emerged as a new challenge beyond image data due to the high computational requirements. Due to their similar appearances, CNNs often extract similar features from video frames. Conventional video object detection pipelines extract features of individual frames with a fixed computational effort, resulting in numerous redundant computations and an inefficient use of energy resources, particularly for edge computing. By exploiting frame-to-frame similarity, this thesis shows that the computational complexity of video object detection pipelines can be reduced. Similarity-aware CNNs are proposed to identify and avoid computations on similar feature pixels across frames. The proposed similarity-aware quantization scheme (SQS) increases the average number of unchanged feature pixels across frame pairs by up to 85% with a loss of less than 1% in detection accuracy. Second, by minimising redundant computations and memory accesses across frame pairs, the proposed similarity-aware row stationary (SRS) dataflow reduces energy consumption. According to simulation experiments, the proposed dataflow reduces video frame processing energy consumption by up to 30%. To further improve the efficiency of video object detection, a new temporal early exit module (TEEM) is proposed. Semantic differences between consecutive frames can be detected using TEEM with low computation overhead, avoiding redundant video frame feature extraction. Multiple TEEMs are inserted into the pipelines’ feature network at various early layers. TEEM-enabled pipelines only require full computation effort when a frame is determined to be semantically distinct from previous frames; otherwise, previous frame detection results are reused. Experiments on ImangeNet VID and TVnet demonstrate that TEEMs accelerate SOTA video object detection pipelines by 1.7× while maintaining a < 1% mean average precision reduction.
Text
Final Thesis for award
- Version of Record
Restricted to Repository staff only
More information
Published date: December 2022
Identifiers
Local EPrints ID: 473997
URI: http://eprints.soton.ac.uk/id/eprint/473997
PURE UUID: 402862d1-dee5-4971-a974-11a086f494e3
Catalogue record
Date deposited: 08 Feb 2023 17:43
Last modified: 17 Mar 2024 03:03
Export record
Contributors
Author:
Mohammadamin Sabetsarvestani
Thesis advisor:
Geoffrey Merrett
Download statistics
Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.
View more statistics