Fg-T2M: Fine-Grained Text-Driven Human Motion Generation via Diffusion Model

About

Text-driven human motion generation in computer vision is both significant and challenging. However, current methods are limited to producing either deterministic or imprecise motion sequences, failing to effectively control the temporal and spatial relationships required to conform to a given text description. In this work, we propose a fine-grained method for generating high-quality, conditional human motion sequences supporting precise text description. Our approach consists of two key components: 1) a linguistics-structure assisted module that constructs accurate and complete language feature to fully utilize text information; and 2) a context-aware progressive reasoning module that learns neighborhood and overall semantic linguistics features from shallow and deep graph neural networks to achieve a multi-step inference. Experiments show that our approach outperforms text-driven motion generation methods on HumanML3D and KIT test sets and generates better visually confirmed motion to the text conditions.

Yin Wang, Zhiying Leng, Frederick W. B. Li, Shun-Cheng Wu, Xiaohui Liang• 2023

Related benchmarks

Task	Dataset	Result
Text-to-motion generation	HumanML3D (test)	FID0.243	553
text-to-motion mapping	HumanML3D (test)	FID0.243	283
text-to-motion mapping	KIT-ML (test)	R Precision (Top 3)0.745	275
Text-to-motion generation	KIT-ML (test)	FID0.571	206
Text-to-motion generation	HumanML3D	FID0.243	91
Text-to-Motion Synthesis	KIT-ML	R Precision Top 374.5	58
Text-to-motion generation	HumanML3D 19 (test)	FID0.243	37
Text-conditional motion synthesis	HumanML3D 12 (test)	R-Precision Top-149.2	15
Text-conditional motion synthesis	HumanML3D 16 (test)	R-Precision Top-10.492	15
Text-to-motion generation	KIT-ML 52 (test)	R-Precision Top-10.418	11

Showing 10 of 10 rows

Other info

Follow for update

@wizwand_team Discord