T2M-GPT: Generating Human Motion from Textual Descriptions with Discrete Representations
About
In this work, we investigate a simple and must-known conditional generative framework based on Vector Quantised-Variational AutoEncoder (VQ-VAE) and Generative Pre-trained Transformer (GPT) for human motion generation from textural descriptions. We show that a simple CNN-based VQ-VAE with commonly used training recipes (EMA and Code Reset) allows us to obtain high-quality discrete representations. For GPT, we incorporate a simple corruption strategy during the training to alleviate training-testing discrepancy. Despite its simplicity, our T2M-GPT shows better performance than competitive approaches, including recent diffusion-based approaches. For example, on HumanML3D, which is currently the largest dataset, we achieve comparable performance on the consistency between text and generated motion (R-Precision), but with FID 0.116 largely outperforming MotionDiffuse of 0.630. Additionally, we conduct analyses on HumanML3D and observe that the dataset size is a limitation of our approach. Our work suggests that VQ-VAE still remains a competitive approach for human motion generation.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Text-to-motion generation | HumanML3D (test) | FID0.116 | 331 | |
| text-to-motion mapping | KIT-ML (test) | R Precision (Top 3)0.745 | 275 | |
| text-to-motion mapping | HumanML3D (test) | FID0.07 | 243 | |
| Sign Language Translation | PHOENIX-2014T (test) | BLEU-411.66 | 159 | |
| Text-to-motion generation | KIT-ML (test) | FID0.512 | 115 | |
| Sign Language Translation | How2Sign (test) | BLEU-43.53 | 61 | |
| Text-to-Motion Synthesis | HumanML3D | R-Precision (Top 1)67.6 | 43 | |
| 3D Human Motion Generation | HumanAct12 | FID0.064 | 36 | |
| Text-driven Motion Generation | HumanML3D (test) | R-Precision@149.7 | 36 | |
| Text-to-motion | KIT-ML | R@374.5 | 33 |