MG-MotionLLM: A Unified Framework for Motion Comprehension and Generation across Multiple Granularities
About
Recent motion-aware large language models have demonstrated promising potential in unifying motion comprehension and generation. However, existing approaches primarily focus on coarse-grained motion-text modeling, where text describes the overall semantics of an entire motion sequence in just a few words. This limits their ability to handle fine-grained motion-relevant tasks, such as understanding and controlling the movements of specific body parts. To overcome this limitation, we pioneer MG-MotionLLM, a unified motion-language model for multi-granular motion comprehension and generation. We further introduce a comprehensive multi-granularity training scheme by incorporating a set of novel auxiliary tasks, such as localizing temporal boundaries of motion segments via detailed text as well as motion detailed captioning, to facilitate mutual reinforcement for motion-text modeling across various levels of granularity. Extensive experiments show that our MG-MotionLLM achieves superior performance on classical text-to-motion and motion-to-text tasks, and exhibits potential in novel fine-grained motion comprehension and editing tasks. Project page: CVI-SZU/MG-MotionLLM
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| text-to-motion mapping | HumanML3D (test) | FID0.303 | 243 | |
| Text-to-Motion Synthesis | HumanML3D | R-Precision (Top 1)51.6 | 43 | |
| Text-driven Motion Generation | HumanML3D (test) | R-Precision@151.6 | 36 | |
| Text-to-motion generation | HumanML3D 1 (test) | R-Precision (Top 1)0.516 | 32 | |
| Motion-to-Text | HumanML3D (test) | BLEU@48.06 | 32 | |
| Speed-based motion generation | AnyContext (test) | R@128.1 | 10 | |
| Trajectory-based motion generation | AnyContext (test) | R@10.193 | 10 | |
| Style-based motion generation | AnyContext (test) | R@10.18 | 10 |