Mogo: residual quantized hierarchical causal transformer for high-quality and real-time 3D human motion generation
Mogo: residual quantized hierarchical causal transformer for high-quality and real-time 3D human motion generation
Recent advances in transformer-based text-to-motion generation have led to impressive progress in synthesizing high-quality human motion. Nevertheless, jointly achieving high fidelity, streaming capability, real-time responsiveness, and scalability remains a fundamental challenge. In this paper, we propose MOGO (Motion Generation with One-pass), a novel autoregressive framework tailored for efficient and real-time 3D motion generation. MOGO comprises two key components: (1) MoSA-VQ, a motion scale-adaptive residual vector quantization module that hierarchically discretizes motion sequences with learnable scaling to produce compact yet expressive representations; and (2) RQHC-Transformer, a residual quantized hierarchical causal transformer that generates multi-layer motion tokens in a single forward pass, significantly reducing inference latency. To enhance semantic fidelity, we further introduce a text condition alignment mechanism that improves motion decoding under textual control. Extensive experiments on benchmark datasets including HumanML3D, KIT-ML, and CMP demonstrate that MOGO achieves competitive or superior generation quality compared to state-of-the-art transformer-based methods, while offering substantial improvements in real-time performance, streaming generation, and generalization under zero-shot settings.
cs.CV, cs.AI
Fu, Dongjie
d5a38410-7e56-4963-9aae-c2c5199621b2
Sun, Tengjiao
c5e1adca-e857-41af-939b-03a12bc57a9b
Fang, Pengcheng
7f3b5cc1-6fd3-4e94-8338-0820f3fbd189
Cai, Xiaohao
de483445-45e9-4b21-a4e8-b0427fc72cee
Kim, Hansung
2c7c135c-f00b-4409-acb2-85b3a9e8225f
6 June 2025
Fu, Dongjie
d5a38410-7e56-4963-9aae-c2c5199621b2
Sun, Tengjiao
c5e1adca-e857-41af-939b-03a12bc57a9b
Fang, Pengcheng
7f3b5cc1-6fd3-4e94-8338-0820f3fbd189
Cai, Xiaohao
de483445-45e9-4b21-a4e8-b0427fc72cee
Kim, Hansung
2c7c135c-f00b-4409-acb2-85b3a9e8225f
[Unknown type: UNSPECIFIED]
Abstract
Recent advances in transformer-based text-to-motion generation have led to impressive progress in synthesizing high-quality human motion. Nevertheless, jointly achieving high fidelity, streaming capability, real-time responsiveness, and scalability remains a fundamental challenge. In this paper, we propose MOGO (Motion Generation with One-pass), a novel autoregressive framework tailored for efficient and real-time 3D motion generation. MOGO comprises two key components: (1) MoSA-VQ, a motion scale-adaptive residual vector quantization module that hierarchically discretizes motion sequences with learnable scaling to produce compact yet expressive representations; and (2) RQHC-Transformer, a residual quantized hierarchical causal transformer that generates multi-layer motion tokens in a single forward pass, significantly reducing inference latency. To enhance semantic fidelity, we further introduce a text condition alignment mechanism that improves motion decoding under textual control. Extensive experiments on benchmark datasets including HumanML3D, KIT-ML, and CMP demonstrate that MOGO achieves competitive or superior generation quality compared to state-of-the-art transformer-based methods, while offering substantial improvements in real-time performance, streaming generation, and generalization under zero-shot settings.
Text
2506.05952v1
- Author's Original
More information
Published date: 6 June 2025
Additional Information:
9 pages, 4 figures, conference
Keywords:
cs.CV, cs.AI
Identifiers
Local EPrints ID: 502990
URI: http://eprints.soton.ac.uk/id/eprint/502990
PURE UUID: d7eadd98-02a1-4d33-b982-700afe6c6a14
Catalogue record
Date deposited: 15 Jul 2025 16:54
Last modified: 17 Jul 2025 02:27
Export record
Altmetrics
Contributors
Author:
Dongjie Fu
Author:
Tengjiao Sun
Author:
Pengcheng Fang
Author:
Xiaohao Cai
Author:
Hansung Kim
Download statistics
Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.
View more statistics