DreamTalk: When Emotional Talking Head Generation Meets Diffusion Probabilistic Models

About

Emotional talking head generation has attracted growing attention. Previous methods, which are mainly GAN-based, still struggle to consistently produce satisfactory results across diverse emotions and cannot conveniently specify personalized emotions. In this work, we leverage powerful diffusion models to address the issue and propose DreamTalk, a framework that employs meticulous design to unlock the potential of diffusion models in generating emotional talking heads. Specifically, DreamTalk consists of three crucial components: a denoising network, a style-aware lip expert, and a style predictor. The diffusion-based denoising network can consistently synthesize high-quality audio-driven face motions across diverse emotions. To enhance lip-motion accuracy and emotional fullness, we introduce a style-aware lip expert that can guide lip-sync while preserving emotion intensity. To more conveniently specify personalized emotions, a diffusion-based style predictor is utilized to predict the personalized emotion directly from the audio, eliminating the need for extra emotion reference. By this means, DreamTalk can consistently generate vivid talking faces across diverse emotions and conveniently specify personalized emotions. Extensive experiments validate DreamTalk's effectiveness and superiority. The code is available at https://github.com/ali-vilab/dreamtalk.

Yifeng Ma, Shiwei Zhang, Jiayu Wang, Xiang Wang, Yingya Zhang, Zhidong Deng• 2023

Related benchmarks

Task	Dataset	Result
Talking Head Generation	HDTF	FID78.147	48
Audio-driven facial animation	MEAD 41 (test)	PSNR27.801	26
Audio-driven facial animation	RAVDESS 42 (test)	PSNR26.193	24
Talking Head Generation	CelebV-HQ	FID77.78	15
Talking Head Reenactment	General Inference (test)	FPS7.832	13
Talking Head Reenactment	General Inference	Inference Speed (FPS)7.832	13
Talking Head Generation	Celeb-V	Sync-C5.709	9
Talking Face Generation	HDTF one-shot	FID78.147	7
Emotion-conditioned generation	Mead	E-Score0.526	5
Talking Head Generation	Proposed Wild Dataset	Sync-C4.498	5

Showing 10 of 11 rows

Other info

Follow for update

@wizwand_team Discord