Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

EDTalk: Efficient Disentanglement for Emotional Talking Head Synthesis

About

Achieving disentangled control over multiple facial motions and accommodating diverse input modalities greatly enhances the application and entertainment of the talking head generation. This necessitates a deep exploration of the decoupling space for facial features, ensuring that they a) operate independently without mutual interference and b) can be preserved to share with different modal input, both aspects often neglected in existing methods. To address this gap, this paper proposes a novel Efficient Disentanglement framework for Talking head generation (EDTalk). Our framework enables individual manipulation of mouth shape, head pose, and emotional expression, conditioned on video or audio inputs. Specifically, we employ three lightweight modules to decompose the facial dynamics into three distinct latent spaces representing mouth, pose, and expression, respectively. Each space is characterized by a set of learnable bases whose linear combinations define specific motions. To ensure independence and accelerate training, we enforce orthogonality among bases and devise an efficient training strategy to allocate motion responsibilities to each space without relying on external knowledge. The learned bases are then stored in corresponding banks, enabling shared visual priors with audio input. Furthermore, considering the properties of each space, we propose an Audio-to-Motion module for audio-driven talking head synthesis. Experiments are conducted to demonstrate the effectiveness of EDTalk. We recommend watching the project website: https://tanshuai0219.github.io/EDTalk/

Shuai Tan, Bin Ji, Mengxiao Bi, Ye Pan• 2024

Related benchmarks

TaskDatasetResultRank
Audio-driven facial animationMEAD 41 (test)
PSNR27.938
26
Audio-driven facial animationRAVDESS 42 (test)
PSNR26.466
24
Audio Driven Talking Head GenerationMead
Sync8.057
14
Audio Driven Talking Head GenerationCREMA
Sync6.3703
14
Talking Head ReenactmentGeneral Inference (test)
FPS16.878
13
Talking Head ReenactmentGeneral Inference
Inference Speed (FPS)16.878
13
Talking Head GenerationVOCASET and HDTF Cross-Reenactment
Sync6.982
7
Talking Head GenerationVOCASET and HDTF Self-Reenactment
PSNR26.9461
7
3D Talking Face SynthesisRegular talking face dataset Obama and May (test)
Sync-C6.173
6
3D Talking Face SynthesisEmotional talking face dataset (MEAD) M003 and M030 (test)
Sync-C7.55
6
Showing 10 of 10 rows

Other info

Follow for update