MotionGPT3: Human Motion as a Second Modality

About

With the rapid progress of large language models (LLMs), multimodal frameworks that unify understanding and generation have become promising, yet they face increasing complexity as the number of modalities and tasks grows. We observe that motion quantization introduces approximation errors that cap motion quality, and that unifying discrete text and continuous motion within a single-stream backbone amplifies cross-modal interference. Motivated by recent multi-branch Transformer designs that separate signals from different modalities, we propose MotionGPT3, a bimodal motion-language model for both understanding and generation. MotionGPT3 encodes raw motion into a continuous latent space using a variational autoencoder (VAE), thereby avoiding quantization-induced artifacts, while leveraging the semantic prior of pretrained language models. A dual-stream Transformer with shared attention preserves modality-specific routes while enabling controlled, bidirectional information flow, which reduces interference, stabilizing optimization, and empirically accelerates convergence without degrading fidelity. For multimodal joint training, a generate-then-align three-stage schedule further improves stability and limits cross-task interference. Experiments show that MotionGPT3 achieves 2x faster convergence in training loss and up to 4x faster convergence in validation, while maintaining state-of-the-art performance on standard motion understanding and motion generation benchmarks.

Bingfan Zhu, Biao Jiang, Sunyi Wang, Shixiang Tang, Tao Chen, Linjie Luo, Youyi Zheng, Xin Chen• 2025

Related benchmarks

Task	Dataset	Result
Text-to-motion generation	HumanML3D (test)	FID0.021	553
text-to-motion mapping	HumanML3D (test)	FID0.208	283
Human Motion Prediction	Human3.6M (test)	MPJPE42.3	85
Text-driven Motion Generation	HumanML3D (test)	R-Precision@154.3	54
Motion-to-Text	HumanML3D (test)	BLEU@419.6	48
Text-to-motion	KIT-ML	R@380.3	44
Text-to-Motion Synthesis	HumanML3D	R-Precision (Top 1)55.3	43
Motion Generation	MBench 16 (official leaderboard)	Jitter Penalty0.012	17
Text-to-motion	MotionHub (test)	R-Precision (T1)20	12
Text-to-motion	HumanML3D 10 (test)	R-Precision@141	12

Showing 10 of 25 rows

Other info

Follow for update

@wizwand_team Discord