Beyond text: multi-modal LLM in human robot interaction
Beyond text: multi-modal LLM in human robot interaction
Multimodal interaction plays a vital role in Human-Robot Interaction (HRI), enabling robots to communicate with humans through multiple channels. This study introduces a novel approach to enhance such interactions by treating images and human motion as distinct foreign languages, in addition to text. In the proposed framework, vector quantization is employed to convert multimodal inputs such as images and human motions to an aligned set of tokens. A Large Language Model (LLM) is then pre-trained with the use of Low-Rank Adaptation (LoRA) and instruction-tuned on a dialogue dataset that incorporates both image and motion context. The proposed multimodal LLM framework aims to equip robots with the ability to understand and respond to complex human queries through multimodal inputs and outputs, enabling more natural and effective interactions.
Ahmad Ansari, Aamir
1f023156-e7cf-46ec-a07c-b8642d4498e5
Ramchurn, Gopal
1d62ae2a-a498-444e-912d-a6082d3aaea3
Nguyen, Tan Viet Tuyen
f6e9374c-5174-4446-b4f0-5e6359efc105
Ahmad Ansari, Aamir
1f023156-e7cf-46ec-a07c-b8642d4498e5
Ramchurn, Gopal
1d62ae2a-a498-444e-912d-a6082d3aaea3
Nguyen, Tan Viet Tuyen
f6e9374c-5174-4446-b4f0-5e6359efc105
Ahmad Ansari, Aamir, Ramchurn, Gopal and Nguyen, Tan Viet Tuyen
(2025)
Beyond text: multi-modal LLM in human robot interaction.
UK AI Research Symposium: A Festival of Ideas, Northumbria University, Newcastle upon Tyne, United Kingdom.
08 - 09 Sep 2025.
3 pp
.
(In Press)
Record type:
Conference or Workshop Item
(Paper)
Abstract
Multimodal interaction plays a vital role in Human-Robot Interaction (HRI), enabling robots to communicate with humans through multiple channels. This study introduces a novel approach to enhance such interactions by treating images and human motion as distinct foreign languages, in addition to text. In the proposed framework, vector quantization is employed to convert multimodal inputs such as images and human motions to an aligned set of tokens. A Large Language Model (LLM) is then pre-trained with the use of Low-Rank Adaptation (LoRA) and instruction-tuned on a dialogue dataset that incorporates both image and motion context. The proposed multimodal LLM framework aims to equip robots with the ability to understand and respond to complex human queries through multimodal inputs and outputs, enabling more natural and effective interactions.
Text
MLLMinHRI-CameraReady
- Accepted Manuscript
Restricted to Repository staff only
Request a copy
More information
Accepted/In Press date: 8 September 2025
Venue - Dates:
UK AI Research Symposium: A Festival of Ideas, Northumbria University, Newcastle upon Tyne, United Kingdom, 2025-09-08 - 2025-09-09
Identifiers
Local EPrints ID: 505659
URI: http://eprints.soton.ac.uk/id/eprint/505659
PURE UUID: 529c2965-155e-4717-9285-d915076b5619
Catalogue record
Date deposited: 15 Oct 2025 16:56
Last modified: 16 Oct 2025 02:11
Export record
Contributors
Author:
Aamir Ahmad Ansari
Author:
Gopal Ramchurn
Author:
Tan Viet Tuyen Nguyen
Download statistics
Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.
View more statistics