A multimodal large language model framework for gesture generation in social robots
A multimodal large language model framework for gesture generation in social robots
Non-verbal gestures play a crucial role in social robots, enabling them to signal their intentions to users during human–robot interaction (HRI). While recent research in this domain has primarily focused on robot gesture generation, there remains a limited number of studies on multimodal generation frameworks, where generated gestures are harmonized with other generated modalities, to better convey the robot’s intention to users via a wider range of communication channels. Inspired by recent advancements in multimodal large language models (MLLMs), we propose a novel framework that integrates motion generation models with existing MLLMs to produce high-quality 3D motions without the need for extensive multimodal training. Our framework comprises three key components: a Denoising Diffusion Motion Generation (DDMG) module that maps text descriptions to motion sequences using a diffusion-based approach; a Motion Decoding Alignment (MDA) module that refines motion representations by incorporating signal embeddings generated by an LLM; and a Fusion Module (FM) that integrates motion features trained from previous phases to enhance coherence and realism. We conducted a series of experiments on a publicly available dataset to evaluate the efficiency of the proposed framework in terms of motion quality, diversity, and semantic alignment. The results suggest that our multimodal approach can serve as a powerful controller for robot gesture generation, offering a more scalable and effective solution, particularly for social HRI.
Le, Thien Doanh
59eb6e74-cf76-460d-b2d6-8464e1d12868
Nguyen, Tan Viet Tuyen
f6e9374c-5174-4446-b4f0-5e6359efc105
Le Duy, Tan
282ef6d2-e7c2-4169-b4ee-e823bd81c82f
Ramchurn, Gopal
1d62ae2a-a498-444e-912d-a6082d3aaea3
25 August 2025
Le, Thien Doanh
59eb6e74-cf76-460d-b2d6-8464e1d12868
Nguyen, Tan Viet Tuyen
f6e9374c-5174-4446-b4f0-5e6359efc105
Le Duy, Tan
282ef6d2-e7c2-4169-b4ee-e823bd81c82f
Ramchurn, Gopal
1d62ae2a-a498-444e-912d-a6082d3aaea3
Le, Thien Doanh, Nguyen, Tan Viet Tuyen, Le Duy, Tan and Ramchurn, Gopal
(2025)
A multimodal large language model framework for gesture generation in social robots.
BEAR Workshop, The 2025 IEEE International Conference on Robot and Human Interactive Communication, , Eindhoven, Netherlands.
25 - 29 Aug 2025.
9 pp
.
Record type:
Conference or Workshop Item
(Paper)
Abstract
Non-verbal gestures play a crucial role in social robots, enabling them to signal their intentions to users during human–robot interaction (HRI). While recent research in this domain has primarily focused on robot gesture generation, there remains a limited number of studies on multimodal generation frameworks, where generated gestures are harmonized with other generated modalities, to better convey the robot’s intention to users via a wider range of communication channels. Inspired by recent advancements in multimodal large language models (MLLMs), we propose a novel framework that integrates motion generation models with existing MLLMs to produce high-quality 3D motions without the need for extensive multimodal training. Our framework comprises three key components: a Denoising Diffusion Motion Generation (DDMG) module that maps text descriptions to motion sequences using a diffusion-based approach; a Motion Decoding Alignment (MDA) module that refines motion representations by incorporating signal embeddings generated by an LLM; and a Fusion Module (FM) that integrates motion features trained from previous phases to enhance coherence and realism. We conducted a series of experiments on a publicly available dataset to evaluate the efficiency of the proposed framework in terms of motion quality, diversity, and semantic alignment. The results suggest that our multimodal approach can serve as a powerful controller for robot gesture generation, offering a more scalable and effective solution, particularly for social HRI.
Text
BEAR2025-Camera_Ready
- Accepted Manuscript
More information
Published date: 25 August 2025
Venue - Dates:
BEAR Workshop, The 2025 IEEE International Conference on Robot and Human Interactive Communication, , Eindhoven, Netherlands, 2025-08-25 - 2025-08-29
Identifiers
Local EPrints ID: 506590
URI: http://eprints.soton.ac.uk/id/eprint/506590
PURE UUID: d0444c4b-3d33-4e3c-afbf-e296ae6f0828
Catalogue record
Date deposited: 11 Nov 2025 17:58
Last modified: 12 Nov 2025 03:06
Export record
Contributors
Author:
Thien Doanh Le
Author:
Tan Viet Tuyen Nguyen
Author:
Tan Le Duy
Author:
Gopal Ramchurn
Download statistics
Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.
View more statistics