Human action recognition based on convolutional neural networks and vision transformers
Human action recognition based on convolutional neural networks and vision transformers
This thesis seeks to deepen our understanding and expand our knowledge of the impacts of deep-learning techniques on human action recognition. It addresses the challenges faced in human action recognition and proposes solutions focused on enhancing feature extraction and optimizing model designs. This is accomplished through the completion of three distinct yet closely interconnected chapters (i.e., papers). These chapters are: (i) Data Augmentation in Classification and Segmentation: A Survey and New Strategies; (ii) TransNet: A Transfer Learning-Based Network for Human Action Recognition; and (iii) RNNs, CNNs, and Transformers in Human Action Recognition: A Survey and a Hybrid Model. The second chapter provides a survey of the existing data augmentation techniques in computer vision tasks, including segmentation and classification. Data augmentation is a well-established method in computer vision. It can be especially beneficial for human action recognition (HAR) by enhancing feature extraction. This technique addresses challenges such as limited datasets and class imbalance, resulting in more robust feature extraction and reduced overfitting in neural networks. Studies have demonstrated that data augmentation significantly enhances the accuracy and generalizability of models in tasks like image classification and segmentation, which is subsequently utilized in the task of HAR in the third chapter. The third chapter addresses two significant challenges in HAR: feature extraction and the complexity of HAR models. It introduces a straightforward, yet versatile and effective end-to-end deep learning architecture, termed TransNet, as a solution to these challenges. Extensive experimental results and comparisons with state-of-the-art models demonstrate the superior performance of TransNet in terms of flexibility, model complexity, transfer learning capability, training speed, and classification accuracy. Additionally, this chapter introduces a novel strategy that utilizes autoencoders to form the 2D component of TransNet, referred to as TransNet+. TransNet+ enhances feature extraction by directing the model to extract specific features based on our needs. TransNet+ leverages the encoder part of an autoencoder, trained on computer vision tasks such as human semantic segmentation (HSS), to perform HAR. The extensive experimental results and comparisons with leading models further validate the superior performance of both TransNet and TransNet+ in HAR. The fourth chapter provides a comprehensive review of Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), and Vision Transformers (ViTs). It examines the progression from traditional methods to the latest advancements in neural network architectures, offering a chronological and extensive analysis of the existing literature on action recognition. The chapter proposes a novel hybrid model that integrates the strengths of CNNs and ViTs. Additionally, it offers a detailed performance comparison of the proposed hybrid model against existing models, highlighting its efficacy in handling complex HAR tasks with improved accuracy and efficiency. The chapter also discusses emerging trends and future directions for HAR technologies.
University of Southampton
Alomar, Khaled
ff1cdb20-40a5-42e3-82db-935881354868
Cai, Xiaohao
de483445-45e9-4b21-a4e8-b0427fc72cee
20 March 2025
Alomar, Khaled
ff1cdb20-40a5-42e3-82db-935881354868
Cai, Xiaohao
de483445-45e9-4b21-a4e8-b0427fc72cee
Cai, Xiaohao
de483445-45e9-4b21-a4e8-b0427fc72cee
Alomar, Khaled and Cai, Xiaohao
(2025)
Human action recognition based on convolutional neural networks and vision transformers.
University of Southampton, Doctoral Thesis, 197pp.
Record type:
Thesis
(Doctoral)
Abstract
This thesis seeks to deepen our understanding and expand our knowledge of the impacts of deep-learning techniques on human action recognition. It addresses the challenges faced in human action recognition and proposes solutions focused on enhancing feature extraction and optimizing model designs. This is accomplished through the completion of three distinct yet closely interconnected chapters (i.e., papers). These chapters are: (i) Data Augmentation in Classification and Segmentation: A Survey and New Strategies; (ii) TransNet: A Transfer Learning-Based Network for Human Action Recognition; and (iii) RNNs, CNNs, and Transformers in Human Action Recognition: A Survey and a Hybrid Model. The second chapter provides a survey of the existing data augmentation techniques in computer vision tasks, including segmentation and classification. Data augmentation is a well-established method in computer vision. It can be especially beneficial for human action recognition (HAR) by enhancing feature extraction. This technique addresses challenges such as limited datasets and class imbalance, resulting in more robust feature extraction and reduced overfitting in neural networks. Studies have demonstrated that data augmentation significantly enhances the accuracy and generalizability of models in tasks like image classification and segmentation, which is subsequently utilized in the task of HAR in the third chapter. The third chapter addresses two significant challenges in HAR: feature extraction and the complexity of HAR models. It introduces a straightforward, yet versatile and effective end-to-end deep learning architecture, termed TransNet, as a solution to these challenges. Extensive experimental results and comparisons with state-of-the-art models demonstrate the superior performance of TransNet in terms of flexibility, model complexity, transfer learning capability, training speed, and classification accuracy. Additionally, this chapter introduces a novel strategy that utilizes autoencoders to form the 2D component of TransNet, referred to as TransNet+. TransNet+ enhances feature extraction by directing the model to extract specific features based on our needs. TransNet+ leverages the encoder part of an autoencoder, trained on computer vision tasks such as human semantic segmentation (HSS), to perform HAR. The extensive experimental results and comparisons with leading models further validate the superior performance of both TransNet and TransNet+ in HAR. The fourth chapter provides a comprehensive review of Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), and Vision Transformers (ViTs). It examines the progression from traditional methods to the latest advancements in neural network architectures, offering a chronological and extensive analysis of the existing literature on action recognition. The chapter proposes a novel hybrid model that integrates the strengths of CNNs and ViTs. Additionally, it offers a detailed performance comparison of the proposed hybrid model against existing models, highlighting its efficacy in handling complex HAR tasks with improved accuracy and efficiency. The chapter also discusses emerging trends and future directions for HAR technologies.
Text
Khaled_Alomar_PhD_Thesis_A3
- Version of Record
Text
Final-thesis-submission-Examination-Mr-Khaled-Alomar
Restricted to Repository staff only
More information
Published date: 20 March 2025
Identifiers
Local EPrints ID: 499512
URI: http://eprints.soton.ac.uk/id/eprint/499512
PURE UUID: 1ac7b10a-ec8a-41cb-ac7d-935c73f5b244
Catalogue record
Date deposited: 21 Mar 2025 18:08
Last modified: 22 Aug 2025 02:30
Export record
Contributors
Author:
Khaled Alomar
Author:
Xiaohao Cai
Thesis advisor:
Xiaohao Cai
Download statistics
Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.
View more statistics