Efficient video recognition with convolutional neural networks by exploiting temporal correlation in video data

Object detection in images is one of the most successful applications of convolutional neural networks (CNNs). However, applying deep CNNs to large numbers of video frames has recently emerged as a new challenge beyond image data due to the high computational requirements. Due to their similar appearances, CNNs often extract similar features from video frames. Conventional video object detection pipelines extract features of individual frames with a fixed computational effort, resulting in numerous redundant computations and an inefficient use of energy resources, particularly for edge computing. By exploiting frame-to-frame similarity, this thesis shows that the computational complexity of video object detection pipelines can be reduced. Similarity-aware CNNs are proposed to identify and avoid computations on similar feature pixels across frames. The proposed similarity-aware quantization scheme (SQS) increases the average number of unchanged feature pixels across frame pairs by up to 85% with a loss of less than 1% in detection accuracy. Second, by minimising redundant computations and memory accesses across frame pairs, the proposed similarity-aware row stationary (SRS) dataflow reduces energy consumption. According to simulation experiments, the proposed dataflow reduces video frame processing energy consumption by up to 30%. To further improve the efficiency of video object detection, a new temporal early exit module (TEEM) is proposed. Semantic differences between consecutive frames can be detected using TEEM with low computation overhead, avoiding redundant video frame feature extraction. Multiple TEEMs are inserted into the pipelines’ feature network at various early layers. TEEM-enabled pipelines only require full computation effort when a frame is determined to be semantically distinct from previous frames; otherwise, previous frame detection results are reused. Experiments on ImangeNet VID and TVnet demonstrate that TEEMs accelerate SOTA video object detection pipelines by 1.7× while maintaining a < 1% mean average precision reduction.

University of Southampton

Sabetsarvestani, Mohammadamin

f5c0e55f-6f0c-4f56-9d6d-7de19d6fb136

December 2022

Sabetsarvestani, Mohammadamin

f5c0e55f-6f0c-4f56-9d6d-7de19d6fb136

Merrett, Geoffrey

89b3a696-41de-44c3-89aa-b0aa29f54020

Sabetsarvestani, Mohammadamin (2022) Efficient video recognition with convolutional neural networks by exploiting temporal correlation in video data. University of Southampton, Doctoral Thesis, 164pp.

Record type: Thesis (Doctoral)