Mogo: residual quantized hierarchical causal transformer for high-quality and real-time 3D human motion generation

Recent advances in transformer-based text-to-motion generation have led to impressive progress in synthesizing high-quality human motion. Nevertheless, jointly achieving high fidelity, streaming capability, real-time responsiveness, and scalability remains a fundamental challenge. In this paper, we propose MOGO (Motion Generation with One-pass), a novel autoregressive framework tailored for efficient and real-time 3D motion generation. MOGO comprises two key components: (1) MoSA-VQ, a motion scale-adaptive residual vector quantization module that hierarchically discretizes motion sequences with learnable scaling to produce compact yet expressive representations; and (2) RQHC-Transformer, a residual quantized hierarchical causal transformer that generates multi-layer motion tokens in a single forward pass, significantly reducing inference latency. To enhance semantic fidelity, we further introduce a text condition alignment mechanism that improves motion decoding under textual control. Extensive experiments on benchmark datasets including HumanML3D, KIT-ML, and CMP demonstrate that MOGO achieves competitive or superior generation quality compared to state-of-the-art transformer-based methods, while offering substantial improvements in real-time performance, streaming generation, and generalization under zero-shot settings.

cs.CV, cs.AI

10.48550/arXiv.2506.05952

arXiv

Fu, Dongjie

d5a38410-7e56-4963-9aae-c2c5199621b2

Sun, Tengjiao

c5e1adca-e857-41af-939b-03a12bc57a9b

Fang, Pengcheng

7f3b5cc1-6fd3-4e94-8338-0820f3fbd189

Cai, Xiaohao

de483445-45e9-4b21-a4e8-b0427fc72cee

Kim, Hansung

2c7c135c-f00b-4409-acb2-85b3a9e8225f

6 June 2025

Fu, Dongjie

d5a38410-7e56-4963-9aae-c2c5199621b2

Sun, Tengjiao

c5e1adca-e857-41af-939b-03a12bc57a9b

Fang, Pengcheng

7f3b5cc1-6fd3-4e94-8338-0820f3fbd189

Cai, Xiaohao

de483445-45e9-4b21-a4e8-b0427fc72cee

Kim, Hansung

2c7c135c-f00b-4409-acb2-85b3a9e8225f

[Unknown type: UNSPECIFIED]

Record type: UNSPECIFIED

Abstract

Text

2506.05952v1 - Author's Original

Available under License Creative Commons Attribution Non-commercial Share Alike.

Download (17MB)

More information

Published date: 6 June 2025

Additional Information: 9 pages, 4 figures, conference

Keywords: cs.CV, cs.AI

Learn more about Vision, Learning and Control research Learn more about School of Electronics and Computer Science research

Identifiers

Local EPrints ID: 502990

URI: http://eprints.soton.ac.uk/id/eprint/502990

DOI: doi:10.48550/arXiv.2506.05952

PURE UUID: d7eadd98-02a1-4d33-b982-700afe6c6a14

ORCID for Pengcheng Fang:

orcid.org/0009-0008-6215-4335

ORCID for Xiaohao Cai:

orcid.org/0000-0003-0924-2834

ORCID for Hansung Kim:

orcid.org/0000-0003-4907-0491

Catalogue record

Date deposited: 15 Jul 2025 16:54

Last modified: 17 Jul 2025 02:27

Export record

Altmetrics

Share this record

Share this on Facebook Share this on Twitter Share this on Weibo

Contributors

Author: Dongjie Fu

Author: Tengjiao Sun

Author: Pengcheng Fang

Author: Xiaohao Cai

Author: Hansung Kim

Download statistics

Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.

View more statistics

Library staff additional information