The University of Southampton
University of Southampton Institutional Repository

Beyond text: multi-modal LLM in human robot interaction

Beyond text: multi-modal LLM in human robot interaction
Beyond text: multi-modal LLM in human robot interaction
Multimodal interaction plays a vital role in Human-Robot Interaction (HRI), enabling robots to communicate with humans through multiple channels. This study introduces a novel approach to enhance such interactions by treating images and human motion as distinct foreign languages, in addition to text. In the proposed framework, vector quantization is employed to convert multimodal inputs such as images and human motions to an aligned set of tokens. A Large Language Model (LLM) is then pre-trained with the use of Low-Rank Adaptation (LoRA) and instruction-tuned on a dialogue dataset that incorporates both image and motion context. The proposed multimodal LLM framework aims to equip robots with the ability to understand and respond to complex human queries through multimodal inputs and outputs, enabling more natural and effective interactions.
Ahmad Ansari, Aamir
1f023156-e7cf-46ec-a07c-b8642d4498e5
Ramchurn, Gopal
1d62ae2a-a498-444e-912d-a6082d3aaea3
Nguyen, Tan Viet Tuyen
f6e9374c-5174-4446-b4f0-5e6359efc105
Ahmad Ansari, Aamir
1f023156-e7cf-46ec-a07c-b8642d4498e5
Ramchurn, Gopal
1d62ae2a-a498-444e-912d-a6082d3aaea3
Nguyen, Tan Viet Tuyen
f6e9374c-5174-4446-b4f0-5e6359efc105

Ahmad Ansari, Aamir, Ramchurn, Gopal and Nguyen, Tan Viet Tuyen (2025) Beyond text: multi-modal LLM in human robot interaction. UK AI Research Symposium: A Festival of Ideas, Northumbria University, Newcastle upon Tyne, United Kingdom. 08 - 09 Sep 2025. 3 pp . (In Press)

Record type: Conference or Workshop Item (Paper)

Abstract

Multimodal interaction plays a vital role in Human-Robot Interaction (HRI), enabling robots to communicate with humans through multiple channels. This study introduces a novel approach to enhance such interactions by treating images and human motion as distinct foreign languages, in addition to text. In the proposed framework, vector quantization is employed to convert multimodal inputs such as images and human motions to an aligned set of tokens. A Large Language Model (LLM) is then pre-trained with the use of Low-Rank Adaptation (LoRA) and instruction-tuned on a dialogue dataset that incorporates both image and motion context. The proposed multimodal LLM framework aims to equip robots with the ability to understand and respond to complex human queries through multimodal inputs and outputs, enabling more natural and effective interactions.

Text
MLLMinHRI-CameraReady - Accepted Manuscript
Restricted to Repository staff only
Request a copy

More information

Accepted/In Press date: 8 September 2025
Venue - Dates: UK AI Research Symposium: A Festival of Ideas, Northumbria University, Newcastle upon Tyne, United Kingdom, 2025-09-08 - 2025-09-09

Identifiers

Local EPrints ID: 505659
URI: http://eprints.soton.ac.uk/id/eprint/505659
PURE UUID: 529c2965-155e-4717-9285-d915076b5619
ORCID for Gopal Ramchurn: ORCID iD orcid.org/0000-0001-9686-4302
ORCID for Tan Viet Tuyen Nguyen: ORCID iD orcid.org/0000-0001-8000-6485

Catalogue record

Date deposited: 15 Oct 2025 16:56
Last modified: 16 Oct 2025 02:11

Export record

Contributors

Author: Aamir Ahmad Ansari
Author: Gopal Ramchurn ORCID iD
Author: Tan Viet Tuyen Nguyen ORCID iD

Download statistics

Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.

View more statistics

Atom RSS 1.0 RSS 2.0

Contact ePrints Soton: eprints@soton.ac.uk

ePrints Soton supports OAI 2.0 with a base URL of http://eprints.soton.ac.uk/cgi/oai2

This repository has been built using EPrints software, developed at the University of Southampton, but available to everyone to use.

We use cookies to ensure that we give you the best experience on our website. If you continue without changing your settings, we will assume that you are happy to receive cookies on the University of Southampton website.

×