MaskControl: Spatio-Temporal Control for Masked Motion Synthesis
About
Recent advances in motion diffusion models have enabled spatially controllable text-to-motion generation. However, these models struggle to achieve high-precision control while maintaining high-quality motion generation. To address these challenges, we propose MaskControl, the first approach to introduce controllability to the generative masked motion model. Our approach introduces two key innovations. First, \textit{Logits Regularizer} implicitly perturbs logits at training time to align the distribution of motion tokens with the controlled joint positions, while regularizing the categorical token prediction to ensure high-fidelity generation. Second, \textit{Logit Optimization} explicitly optimizes the predicted logits during inference time, directly reshaping the token distribution that forces the generated motion to accurately align with the controlled joint positions. Moreover, we introduce \textit{Differentiable Expectation Sampling (DES)} to combat the non-differential distribution sampling process encountered by logits regularizer and optimization. Extensive experiments demonstrate that MaskControl outperforms state-of-the-art methods, achieving superior motion quality (FID decreases by ~77\%) and higher control precision (average error 0.91 vs. 1.08). Additionally, MaskControl enables diverse applications, including any-joint-any-frame control, body-part timeline control, and zero-shot objective control. Video visualization can be found at https://www.ekkasit.com/ControlMM-page/
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Motion Control | HumanML3D (test) | Average Error0.0072 | 65 | |
| Task HSI-2: avoiding barrier | HumanML3D (test) | Foot Skate0.146 | 9 | |
| Geometric-Constrained Motion Generation | Geometric-Constrained Generation | Trajectory Error3.064 | 8 | |
| Joint-controlled motion generation | HumanML3D Pelvis | R-Precision (Top-3)80.3 | 7 | |
| Joint-controlled motion generation | HumanML3D Average | R-Precision (Top-3)80.9 | 7 | |
| Avoiding overhead barrier human-scene interaction | ProgMoGen protocol Unseen tasks (Task 2) | Skating Ratio0.163 | 3 | |
| Head height constraint human-scene interaction | ProgMoGen protocol Unseen tasks (Task 1) | R-Precision (Top-3)69.4 | 3 | |
| Walking inside a square human-scene interaction | ProgMoGen protocol Unseen tasks (Task 3) | Skating Ratio9.9 | 3 |