RNNs, CNNs and transformers in human action recognition: a survey and a hybrid model

Human Action Recognition (HAR) encompasses the task of monitoring human activities across various domains, including but not limited to medical, educational, entertainment, visual surveillance, video retrieval, and the identification of anomalous activities. Over the past decade, the field of HAR has witnessed substantial progress by leveraging Convolutional Neural Networks (CNNs) to effectively extract and comprehend intricate information, thereby enhancing the overall performance of HAR systems. Recently, the domain of computer vision has witnessed the emergence of Vision Transformers (ViTs) as a potent solution. The efficacy of transformer architecture has been validated beyond the confines of image analysis, extending their applicability to diverse video-related tasks. Notably, within this landscape, the research community has shown keen interest in HAR, acknowledging its manifold utility and widespread adoption across various domains. This article aims to present an encompassing survey that focuses on CNNs and the evolution of Recurrent Neural Networks (RNNs) to ViTs given their importance in the domain of HAR. By conducting a thorough examination of existing literature and exploring emerging trends, this study undertakes a critical analysis and synthesis of the accumulated knowledge in this field. Additionally, it investigates the ongoing efforts to develop hybrid approaches. Following this direction, this article presents a novel hybrid model that seeks to integrate the inherent strengths of CNNs and ViTs.

cs.CV, cs.AI, cs.LG

10.48550/arXiv.2407.06162

arXiv

Alomar, Khaled

ff1cdb20-40a5-42e3-82db-935881354868

Aysel, Halil Ibrahim

9db69eca-47c7-4443-86a1-33504e172d60

Cai, Xiaohao

de483445-45e9-4b21-a4e8-b0427fc72cee

2 June 2024

Alomar, Khaled

ff1cdb20-40a5-42e3-82db-935881354868

Aysel, Halil Ibrahim

9db69eca-47c7-4443-86a1-33504e172d60

Cai, Xiaohao

de483445-45e9-4b21-a4e8-b0427fc72cee

[Unknown type: UNSPECIFIED]

Record type: UNSPECIFIED

Abstract

Text

2407.06162v2 - Author's Original

Available under License Creative Commons Attribution.

Download (1MB)

More information

Published date: 2 June 2024

Keywords: cs.CV, cs.AI, cs.LG

Learn more about Vision, Learning and Control research Learn more about School of Electronics and Computer Science research

Identifiers

Local EPrints ID: 498020

URI: http://eprints.soton.ac.uk/id/eprint/498020

DOI: doi:10.48550/arXiv.2407.06162

PURE UUID: 117ea2a4-fb4e-4aa5-803d-9166120ec9d6

ORCID for Khaled Alomar:

orcid.org/0000-0002-8303-3240

ORCID for Halil Ibrahim Aysel:

orcid.org/0000-0002-4981-0827

ORCID for Xiaohao Cai:

orcid.org/0000-0003-0924-2834

Catalogue record

Date deposited: 06 Feb 2025 17:31

Last modified: 07 Feb 2025 03:04

Export record

Altmetrics

Share this record

Share this on Facebook Share this on Twitter Share this on Weibo

Contributors

Author: Khaled Alomar

Author: Halil Ibrahim Aysel

Author: Xiaohao Cai

Download statistics

Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.

View more statistics

Library staff additional information