The University of Southampton
University of Southampton Institutional Repository

CNNs, RNNs and Transformers in human action recognition: a survey and a hybrid model

CNNs, RNNs and Transformers in human action recognition: a survey and a hybrid model
CNNs, RNNs and Transformers in human action recognition: a survey and a hybrid model
Human action recognition (HAR) encompasses the task of monitoring human activities across various domains, including but not limited to medical, educational, entertainment, visual surveillance, video retrieval, and the identification of anomalous activities. Over the past decade, the field of HAR has witnessed substantial progress by leveraging convolutional neural networks (CNNs) and recurrent neural networks (RNNs) to effectively extract and comprehend intricate information, thereby enhancing the overall performance of HAR systems. Recently, the domain of computer vision has witnessed the emergence of Vision Transformers (ViTs) as a potent solution. The efficacy of Transformer architecture has been validated beyond the confines of image analysis, extending their applicability to diverse video-related tasks. Notably, within this landscape, the research community has shown keen interest in HAR, acknowledging its manifold utility and widespread adoption across various domains. However, HAR remains a challenging task due to variations in human motion, occlusions, viewpoint differences, background clutter, and the need for efficient spatio-temporal feature extraction. Additionally, the trade-off between computational efficiency and recognition accuracy remains a significant obstacle, particularly with the adoption of deep learning models requiring extensive training data and resources. This article aims to present an encompassing survey that focuses on CNNs and the evolution of RNNs to ViTs given their importance in the domain of HAR. By conducting a thorough examination of existing literature and exploring emerging trends, this study undertakes a critical analysis and synthesis of the accumulated knowledge in this field. Additionally, it investigates the ongoing efforts to develop hybrid approaches. Following this direction, this article presents a novel hybrid model that seeks to integrate the inherent strengths of CNNs and ViTs.
Convolutional neural networks, Deep learning, Human action recognition, Recurrent neural networks, Video classification, Vision transformers
0269-2821
Alomar, Khaled
ff1cdb20-40a5-42e3-82db-935881354868
Aysel, Halil Ibrahim
9db69eca-47c7-4443-86a1-33504e172d60
Cai, Xiaohao
de483445-45e9-4b21-a4e8-b0427fc72cee
Alomar, Khaled
ff1cdb20-40a5-42e3-82db-935881354868
Aysel, Halil Ibrahim
9db69eca-47c7-4443-86a1-33504e172d60
Cai, Xiaohao
de483445-45e9-4b21-a4e8-b0427fc72cee

Alomar, Khaled, Aysel, Halil Ibrahim and Cai, Xiaohao (2025) CNNs, RNNs and Transformers in human action recognition: a survey and a hybrid model. Artificial Intelligence Review, 58 (12), [387]. (doi:10.1007/s10462-025-11388-3).

Record type: Article

Abstract

Human action recognition (HAR) encompasses the task of monitoring human activities across various domains, including but not limited to medical, educational, entertainment, visual surveillance, video retrieval, and the identification of anomalous activities. Over the past decade, the field of HAR has witnessed substantial progress by leveraging convolutional neural networks (CNNs) and recurrent neural networks (RNNs) to effectively extract and comprehend intricate information, thereby enhancing the overall performance of HAR systems. Recently, the domain of computer vision has witnessed the emergence of Vision Transformers (ViTs) as a potent solution. The efficacy of Transformer architecture has been validated beyond the confines of image analysis, extending their applicability to diverse video-related tasks. Notably, within this landscape, the research community has shown keen interest in HAR, acknowledging its manifold utility and widespread adoption across various domains. However, HAR remains a challenging task due to variations in human motion, occlusions, viewpoint differences, background clutter, and the need for efficient spatio-temporal feature extraction. Additionally, the trade-off between computational efficiency and recognition accuracy remains a significant obstacle, particularly with the adoption of deep learning models requiring extensive training data and resources. This article aims to present an encompassing survey that focuses on CNNs and the evolution of RNNs to ViTs given their importance in the domain of HAR. By conducting a thorough examination of existing literature and exploring emerging trends, this study undertakes a critical analysis and synthesis of the accumulated knowledge in this field. Additionally, it investigates the ongoing efforts to develop hybrid approaches. Following this direction, this article presents a novel hybrid model that seeks to integrate the inherent strengths of CNNs and ViTs.

Text
s10462-025-11388-3 - Version of Record
Available under License Creative Commons Attribution.
Download (3MB)

More information

Accepted/In Press date: 31 August 2025
Published date: 17 October 2025
Additional Information: Publisher Copyright: © The Author(s) 2025.
Keywords: Convolutional neural networks, Deep learning, Human action recognition, Recurrent neural networks, Video classification, Vision transformers

Identifiers

Local EPrints ID: 507110
URI: http://eprints.soton.ac.uk/id/eprint/507110
ISSN: 0269-2821
PURE UUID: c022c1a4-7a4a-48f0-9c22-a6e12bffcafa
ORCID for Khaled Alomar: ORCID iD orcid.org/0000-0002-8303-3240
ORCID for Halil Ibrahim Aysel: ORCID iD orcid.org/0000-0002-4981-0827
ORCID for Xiaohao Cai: ORCID iD orcid.org/0000-0003-0924-2834

Catalogue record

Date deposited: 27 Nov 2025 17:35
Last modified: 28 Nov 2025 02:58

Export record

Altmetrics

Contributors

Author: Khaled Alomar ORCID iD
Author: Halil Ibrahim Aysel ORCID iD
Author: Xiaohao Cai ORCID iD

Download statistics

Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.

View more statistics

Atom RSS 1.0 RSS 2.0

Contact ePrints Soton: eprints@soton.ac.uk

ePrints Soton supports OAI 2.0 with a base URL of http://eprints.soton.ac.uk/cgi/oai2

This repository has been built using EPrints software, developed at the University of Southampton, but available to everyone to use.

We use cookies to ensure that we give you the best experience on our website. If you continue without changing your settings, we will assume that you are happy to receive cookies on the University of Southampton website.

×