A multimodal large language model framework for gesture generation in social robots

Non-verbal gestures play a crucial role in social robots, enabling them to signal their intentions to users during human–robot interaction (HRI). While recent research in this domain has primarily focused on robot gesture generation, there remains a limited number of studies on multimodal generation frameworks, where generated gestures are harmonized with other generated modalities, to better convey the robot’s intention to users via a wider range of communication channels. Inspired by recent advancements in multimodal large language models (MLLMs), we propose a novel framework that integrates motion generation models with existing MLLMs to produce high-quality 3D motions without the need for extensive multimodal training. Our framework comprises three key components: a Denoising Diffusion Motion Generation (DDMG) module that maps text descriptions to motion sequences using a diffusion-based approach; a Motion Decoding Alignment (MDA) module that refines motion representations by incorporating signal embeddings generated by an LLM; and a Fusion Module (FM) that integrates motion features trained from previous phases to enhance coherence and realism. We conducted a series of experiments on a publicly available dataset to evaluate the efficiency of the proposed framework in terms of motion quality, diversity, and semantic alignment. The results suggest that our multimodal approach can serve as a powerful controller for robot gesture generation, offering a more scalable and effective solution, particularly for social HRI.

Le, Thien Doanh

59eb6e74-cf76-460d-b2d6-8464e1d12868

Nguyen, Tan Viet Tuyen

f6e9374c-5174-4446-b4f0-5e6359efc105

Le Duy, Tan

282ef6d2-e7c2-4169-b4ee-e823bd81c82f

Ramchurn, Gopal

1d62ae2a-a498-444e-912d-a6082d3aaea3

25 August 2025

Le, Thien Doanh

59eb6e74-cf76-460d-b2d6-8464e1d12868

Nguyen, Tan Viet Tuyen

f6e9374c-5174-4446-b4f0-5e6359efc105

Le Duy, Tan

282ef6d2-e7c2-4169-b4ee-e823bd81c82f

Ramchurn, Gopal

1d62ae2a-a498-444e-912d-a6082d3aaea3

Le, Thien Doanh, Nguyen, Tan Viet Tuyen, Le Duy, Tan and Ramchurn, Gopal (2025) A multimodal large language model framework for gesture generation in social robots. BEAR Workshop, The 2025 IEEE International Conference on Robot and Human Interactive Communication, , Eindhoven, Netherlands. 25 - 29 Aug 2025. 9 pp .

Record type: Conference or Workshop Item (Paper)

Abstract

Text

BEAR2025-Camera_Ready - Accepted Manuscript

Available under License Creative Commons Attribution.

Download (2MB)

More information

Published date: 25 August 2025

Venue - Dates: BEAR Workshop, The 2025 IEEE International Conference on Robot and Human Interactive Communication, , Eindhoven, Netherlands, 2025-08-25 - 2025-08-29

Learn more about Agents, Interactions and Complexity research