Beyond text: multi-modal LLM in human robot interaction

Ahmad Ansari, Aamir, Ramchurn, Gopal and Nguyen, Tan Viet Tuyen (2025) Beyond text: multi-modal LLM in human robot interaction. UK AI Research Symposium: A Festival of Ideas, Northumbria University, Newcastle upon Tyne, United Kingdom. 08 - 09 Sep 2025. 3 pp . (In Press)

Record type: Conference or Workshop Item (Paper)

Abstract

Multimodal interaction plays a vital role in Human-Robot Interaction (HRI), enabling robots to communicate with humans through multiple channels. This study introduces a novel approach to enhance such interactions by treating images and human motion as distinct foreign languages, in addition to text. In the proposed framework, vector quantization is employed to convert multimodal inputs such as images and human motions to an aligned set of tokens. A Large Language Model (LLM) is then pre-trained with the use of Low-Rank Adaptation (LoRA) and instruction-tuned on a dialogue dataset that incorporates both image and motion context. The proposed multimodal LLM framework aims to equip robots with the ability to understand and respond to complex human queries through multimodal inputs and outputs, enabling more natural and effective interactions.

Text

MLLMinHRI-CameraReady - Accepted Manuscript

Restricted to Repository staff only

Request a copy