Human action recognition based on convolutional neural networks and vision transformers

This thesis seeks to deepen our understanding and expand our knowledge of the impacts of deep-learning techniques on human action recognition. It addresses the challenges faced in human action recognition and proposes solutions focused on enhancing feature extraction and optimizing model designs. This is accomplished through the completion of three distinct yet closely interconnected chapters (i.e., papers). These chapters are: (i) Data Augmentation in Classification and Segmentation: A Survey and New Strategies; (ii) TransNet: A Transfer Learning-Based Network for Human Action Recognition; and (iii) RNNs, CNNs, and Transformers in Human Action Recognition: A Survey and a Hybrid Model. The second chapter provides a survey of the existing data augmentation techniques in computer vision tasks, including segmentation and classification. Data augmentation is a well-established method in computer vision. It can be especially beneficial for human action recognition (HAR) by enhancing feature extraction. This technique addresses challenges such as limited datasets and class imbalance, resulting in more robust feature extraction and reduced overfitting in neural networks. Studies have demonstrated that data augmentation significantly enhances the accuracy and generalizability of models in tasks like image classification and segmentation, which is subsequently utilized in the task of HAR in the third chapter. The third chapter addresses two significant challenges in HAR: feature extraction and the complexity of HAR models. It introduces a straightforward, yet versatile and effective end-to-end deep learning architecture, termed TransNet, as a solution to these challenges. Extensive experimental results and comparisons with state-of-the-art models demonstrate the superior performance of TransNet in terms of flexibility, model complexity, transfer learning capability, training speed, and classification accuracy. Additionally, this chapter introduces a novel strategy that utilizes autoencoders to form the 2D component of TransNet, referred to as TransNet+. TransNet+ enhances feature extraction by directing the model to extract specific features based on our needs. TransNet+ leverages the encoder part of an autoencoder, trained on computer vision tasks such as human semantic segmentation (HSS), to perform HAR. The extensive experimental results and comparisons with leading models further validate the superior performance of both TransNet and TransNet+ in HAR. The fourth chapter provides a comprehensive review of Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), and Vision Transformers (ViTs). It examines the progression from traditional methods to the latest advancements in neural network architectures, offering a chronological and extensive analysis of the existing literature on action recognition. The chapter proposes a novel hybrid model that integrates the strengths of CNNs and ViTs. Additionally, it offers a detailed performance comparison of the proposed hybrid model against existing models, highlighting its efficacy in handling complex HAR tasks with improved accuracy and efficiency. The chapter also discusses emerging trends and future directions for HAR technologies.

University of Southampton

Alomar, Khaled

ff1cdb20-40a5-42e3-82db-935881354868

Cai, Xiaohao

de483445-45e9-4b21-a4e8-b0427fc72cee

20 March 2025

Alomar, Khaled

ff1cdb20-40a5-42e3-82db-935881354868

Cai, Xiaohao

de483445-45e9-4b21-a4e8-b0427fc72cee

Cai, Xiaohao

de483445-45e9-4b21-a4e8-b0427fc72cee

Alomar, Khaled and Cai, Xiaohao (2025) Human action recognition based on convolutional neural networks and vision transformers. University of Southampton, Doctoral Thesis, 197pp.

Record type: Thesis (Doctoral)