Taming Diffusion Models for Audio-Driven Co-Speech Gesture Generation

About

Animating virtual avatars to make co-speech gestures facilitates various applications in human-machine interaction. The existing methods mainly rely on generative adversarial networks (GANs), which typically suffer from notorious mode collapse and unstable training, thus making it difficult to learn accurate audio-gesture joint distributions. In this work, we propose a novel diffusion-based framework, named Diffusion Co-Speech Gesture (DiffGesture), to effectively capture the cross-modal audio-to-gesture associations and preserve temporal coherence for high-fidelity audio-driven co-speech gesture generation. Specifically, we first establish the diffusion-conditional generation process on clips of skeleton sequences and audio to enable the whole framework. Then, a novel Diffusion Audio-Gesture Transformer is devised to better attend to the information from multiple modalities and model the long-term temporal dependency. Moreover, to eliminate temporal inconsistency, we propose an effective Diffusion Gesture Stabilizer with an annealed noise sampling strategy. Benefiting from the architectural advantages of diffusion models, we further incorporate implicit classifier-free guidance to trade off between diversity and gesture quality. Extensive experiments demonstrate that DiffGesture achieves state-of-theart performance, which renders coherent gestures with better mode coverage and stronger audio correlations. Code is available at https://github.com/Advocate99/DiffGesture.

Lingting Zhu, Xian Liu, Xuanyu Liu, Rui Qian, Ziwei Liu, Lequan Yu• 2023

Related benchmarks

Task	Dataset	Result
3D co-speech gesture generation	TED-ETrans (test)	FGD_h+t18.69	14
3D co-speech gesture generation	BEAT-ETrans (test)	FGD (h+t)6.68	14
Speech-driven Holistic Expression and Gesture Generation	BEAT 2022 (test)	FMD1.21e+4	9
Co-speech gesture generation	TED Gesture & TED Expressive User Study (test)	Naturalness4	7
Co-speech gesture generation	TED Gesture	FGD1.506	7
Co-speech gesture generation	TED Expressive	FGD2.6	7
Speech-driven gesture generation	BEAT (test)	--	7
Co-speech gesture generation	Human Evaluation User Study	Naturalness3.56	6
Audio-conditioned full-body animation	OmniHuman Single-person audio conditioned human animation (test)	Sync-C0.496	6
Monadic Co-speech Gesture Synthesis	BEAT 47 (test)	BeatAlign96	5

Showing 10 of 12 rows

Other info

Code

Follow for update

@wizwand_team Discord