Temporal Consistency-Aware Text-to-Motion Generation

About

Text-to-Motion (T2M) generation aims to synthesize realistic human motion sequences from natural language descriptions. While two-stage frameworks leveraging discrete motion representations have advanced T2M research, they often neglect cross-sequence temporal consistency, i.e., the shared temporal structures present across different instances of the same action. This leads to semantic misalignments and physically implausible motions. To address this limitation, we propose TCA-T2M, a framework for temporal consistency-aware T2M generation. Our approach introduces a temporal consistency-aware spatial VQ-VAE (TCaS-VQ-VAE) for cross-sequence temporal alignment, coupled with a masked motion transformer for text-conditioned motion generation. Additionally, a kinematic constraint block mitigates discretization artifacts to ensure physical plausibility. Experiments on HumanML3D and KIT-ML benchmarks demonstrate that TCA-T2M achieves state-of-the-art performance, highlighting the importance of temporal consistency in robust and coherent T2M generation.

Hongsong Wang, Wenjing Yan, Qiuxia Lai, Xin Geng• 2026

Related benchmarks

Task	Dataset	Result
Text-to-motion generation	KIT-ML (test)	FID0.198	206
Text-to-motion generation	HumanML3D 1 (test)	R-Precision (Top 1)0.517	32
Text-to-motion generation	KIT-ML 1.0 (test)	R-Precision Top 142.8	14

Showing 3 of 3 rows

Other info

Follow for update

@wizwand_team Discord