MaskControl: Spatio-Temporal Control for Masked Motion Synthesis

About

Recent advances in motion diffusion models have enabled spatially controllable text-to-motion generation. However, these models struggle to achieve high-precision control while maintaining high-quality motion generation. To address these challenges, we propose MaskControl, the first approach to introduce controllability to the generative masked motion model. Our approach introduces two key innovations. First, \textit{Logits Regularizer} implicitly perturbs logits at training time to align the distribution of motion tokens with the controlled joint positions, while regularizing the categorical token prediction to ensure high-fidelity generation. Second, \textit{Logit Optimization} explicitly optimizes the predicted logits during inference time, directly reshaping the token distribution that forces the generated motion to accurately align with the controlled joint positions. Moreover, we introduce \textit{Differentiable Expectation Sampling (DES)} to combat the non-differential distribution sampling process encountered by logits regularizer and optimization. Extensive experiments demonstrate that MaskControl outperforms state-of-the-art methods, achieving superior motion quality (FID decreases by ~77\%) and higher control precision (average error 0.91 vs. 1.08). Additionally, MaskControl enables diverse applications, including any-joint-any-frame control, body-part timeline control, and zero-shot objective control. Video visualization can be found at https://www.ekkasit.com/ControlMM-page/

Ekkasit Pinyoanuntapong, Muhammad Usama Saleem, Korrawe Karunratanakul, Pu Wang, Hongfei Xue, Chen Chen, Chuan Guo, Junli Cao, Jian Ren, Sergey Tulyakov• 2024

Related benchmarks

Task	Dataset	Result
Motion Control	HumanML3D (test)	Average Error0.0072	82
Task HSI-2: avoiding barrier	HumanML3D (test)	Foot Skate0.146	9
Geometric-Constrained Motion Generation	Geometric-Constrained Generation	Trajectory Error3.064	8
Joint-controlled motion generation	HumanML3D Pelvis	R-Precision (Top-3)80.3	7
Joint-controlled motion generation	HumanML3D Average	R-Precision (Top-3)80.9	7
Avoiding overhead barrier human-scene interaction	ProgMoGen protocol Unseen tasks (Task 2)	Skating Ratio0.163	3
Head height constraint human-scene interaction	ProgMoGen protocol Unseen tasks (Task 1)	R-Precision (Top-3)69.4	3
Walking inside a square human-scene interaction	ProgMoGen protocol Unseen tasks (Task 3)	Skating Ratio9.9	3

Showing 8 of 8 rows

Other info

Follow for update

@wizwand_team Discord