Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

ReMoDiffuse: Retrieval-Augmented Motion Diffusion Model

About

3D human motion generation is crucial for creative industry. Recent advances rely on generative models with domain knowledge for text-driven motion generation, leading to substantial progress in capturing common motions. However, the performance on more diverse motions remains unsatisfactory. In this work, we propose ReMoDiffuse, a diffusion-model-based motion generation framework that integrates a retrieval mechanism to refine the denoising process. ReMoDiffuse enhances the generalizability and diversity of text-driven motion generation with three key designs: 1) Hybrid Retrieval finds appropriate references from the database in terms of both semantic and kinematic similarities. 2) Semantic-Modulated Transformer selectively absorbs retrieval knowledge, adapting to the difference between retrieved samples and the target motion sequence. 3) Condition Mixture better utilizes the retrieval database during inference, overcoming the scale sensitivity in classifier-free guidance. Extensive experiments demonstrate that ReMoDiffuse outperforms state-of-the-art methods by balancing both text-motion consistency and motion quality, especially for more diverse motion generation.

Mingyuan Zhang, Xinying Guo, Liang Pan, Zhongang Cai, Fangzhou Hong, Huirong Li, Lei Yang, Ziwei Liu• 2023

Related benchmarks

TaskDatasetResultRank
Text-to-motion generationHumanML3D (test)
FID0.103
331
text-to-motion mappingKIT-ML (test)
R Precision (Top 3)0.765
275
text-to-motion mappingHumanML3D (test)
FID0.103
243
Text-to-motion generationKIT-ML (test)
FID0.155
115
Text-to-motion generationHumanML3D 19 (test)
FID0.103
37
Interactive Motion SynthesisInterHuman (test)
R Precision (Top 1)44.2
25
Text-to-motion generationHumanML3D full dimension (test)
R-Precision Top 146.8
20
Multi-concept motion generation (single text)MTT (test)
R@17.4
16
Text-to-motion generationKIT-ML 52 (test)
R-Precision Top-10.427
11
Text-to-motion generationKIT-ML full dimension (test)
R-Precision@135.6
9
Showing 10 of 15 rows

Other info

Code

Follow for update