The University of Southampton
University of Southampton Institutional Repository

A multimodal large language model framework for gesture generation in social robots

A multimodal large language model framework for gesture generation in social robots
A multimodal large language model framework for gesture generation in social robots
Non-verbal gestures play a crucial role in social robots, enabling them to signal their intentions to users during human–robot interaction (HRI). While recent research in this domain has primarily focused on robot gesture generation, there remains a limited number of studies on multimodal generation frameworks, where generated gestures are harmonized with other generated modalities, to better convey the robot’s intention to users via a wider range of communication channels. Inspired by recent advancements in multimodal large language models (MLLMs), we propose a novel framework that integrates motion generation models with existing MLLMs to produce high-quality 3D motions without the need for extensive multimodal training. Our framework comprises three key components: a Denoising Diffusion Motion Generation (DDMG) module that maps text descriptions to motion sequences using a diffusion-based approach; a Motion Decoding Alignment (MDA) module that refines motion representations by incorporating signal embeddings generated by an LLM; and a Fusion Module (FM) that integrates motion features trained from previous phases to enhance coherence and realism. We conducted a series of experiments on a publicly available dataset to evaluate the efficiency of the proposed framework in terms of motion quality, diversity, and semantic alignment. The results suggest that our multimodal approach can serve as a powerful controller for robot gesture generation, offering a more scalable and effective solution, particularly for social HRI.
Le, Thien Doanh
59eb6e74-cf76-460d-b2d6-8464e1d12868
Nguyen, Tan Viet Tuyen
f6e9374c-5174-4446-b4f0-5e6359efc105
Le Duy, Tan
282ef6d2-e7c2-4169-b4ee-e823bd81c82f
Ramchurn, Gopal
1d62ae2a-a498-444e-912d-a6082d3aaea3
Le, Thien Doanh
59eb6e74-cf76-460d-b2d6-8464e1d12868
Nguyen, Tan Viet Tuyen
f6e9374c-5174-4446-b4f0-5e6359efc105
Le Duy, Tan
282ef6d2-e7c2-4169-b4ee-e823bd81c82f
Ramchurn, Gopal
1d62ae2a-a498-444e-912d-a6082d3aaea3

Le, Thien Doanh, Nguyen, Tan Viet Tuyen, Le Duy, Tan and Ramchurn, Gopal (2025) A multimodal large language model framework for gesture generation in social robots. BEAR Workshop, The 2025 IEEE International Conference on Robot and Human Interactive Communication, , Eindhoven, Netherlands. 25 - 29 Aug 2025. 9 pp .

Record type: Conference or Workshop Item (Paper)

Abstract

Non-verbal gestures play a crucial role in social robots, enabling them to signal their intentions to users during human–robot interaction (HRI). While recent research in this domain has primarily focused on robot gesture generation, there remains a limited number of studies on multimodal generation frameworks, where generated gestures are harmonized with other generated modalities, to better convey the robot’s intention to users via a wider range of communication channels. Inspired by recent advancements in multimodal large language models (MLLMs), we propose a novel framework that integrates motion generation models with existing MLLMs to produce high-quality 3D motions without the need for extensive multimodal training. Our framework comprises three key components: a Denoising Diffusion Motion Generation (DDMG) module that maps text descriptions to motion sequences using a diffusion-based approach; a Motion Decoding Alignment (MDA) module that refines motion representations by incorporating signal embeddings generated by an LLM; and a Fusion Module (FM) that integrates motion features trained from previous phases to enhance coherence and realism. We conducted a series of experiments on a publicly available dataset to evaluate the efficiency of the proposed framework in terms of motion quality, diversity, and semantic alignment. The results suggest that our multimodal approach can serve as a powerful controller for robot gesture generation, offering a more scalable and effective solution, particularly for social HRI.

Text
BEAR2025-Camera_Ready - Accepted Manuscript
Available under License Creative Commons Attribution.
Download (2MB)

More information

Published date: 25 August 2025
Venue - Dates: BEAR Workshop, The 2025 IEEE International Conference on Robot and Human Interactive Communication, , Eindhoven, Netherlands, 2025-08-25 - 2025-08-29

Identifiers

Local EPrints ID: 506590
URI: http://eprints.soton.ac.uk/id/eprint/506590
PURE UUID: d0444c4b-3d33-4e3c-afbf-e296ae6f0828
ORCID for Tan Viet Tuyen Nguyen: ORCID iD orcid.org/0000-0001-8000-6485
ORCID for Gopal Ramchurn: ORCID iD orcid.org/0000-0001-9686-4302

Catalogue record

Date deposited: 11 Nov 2025 17:58
Last modified: 12 Nov 2025 03:06

Export record

Contributors

Author: Thien Doanh Le
Author: Tan Viet Tuyen Nguyen ORCID iD
Author: Tan Le Duy
Author: Gopal Ramchurn ORCID iD

Download statistics

Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.

View more statistics

Atom RSS 1.0 RSS 2.0

Contact ePrints Soton: eprints@soton.ac.uk

ePrints Soton supports OAI 2.0 with a base URL of http://eprints.soton.ac.uk/cgi/oai2

This repository has been built using EPrints software, developed at the University of Southampton, but available to everyone to use.

We use cookies to ensure that we give you the best experience on our website. If you continue without changing your settings, we will assume that you are happy to receive cookies on the University of Southampton website.

×