SyncTalk: The Devil is in the Synchronization for Talking Head Synthesis

About

Achieving high synchronization in the synthesis of realistic, speech-driven talking head videos presents a significant challenge. Traditional Generative Adversarial Networks (GAN) struggle to maintain consistent facial identity, while Neural Radiance Fields (NeRF) methods, although they can address this issue, often produce mismatched lip movements, inadequate facial expressions, and unstable head poses. A lifelike talking head requires synchronized coordination of subject identity, lip movements, facial expressions, and head poses. The absence of these synchronizations is a fundamental flaw, leading to unrealistic and artificial outcomes. To address the critical issue of synchronization, identified as the "devil" in creating realistic talking heads, we introduce SyncTalk. This NeRF-based method effectively maintains subject identity, enhancing synchronization and realism in talking head synthesis. SyncTalk employs a Face-Sync Controller to align lip movements with speech and innovatively uses a 3D facial blendshape model to capture accurate facial expressions. Our Head-Sync Stabilizer optimizes head poses, achieving more natural head movements. The Portrait-Sync Generator restores hair details and blends the generated head with the torso for a seamless visual experience. Extensive experiments and user studies demonstrate that SyncTalk outperforms state-of-the-art methods in synchronization and realism. We recommend watching the supplementary video: https://ziqiaopeng.github.io/synctalk

Ziqiao Peng, Wentao Hu, Yue Shi, Xiangyu Zhu, Xiaomei Zhang, Hao Zhao, Jun He, Hongyan Liu, Zhaoxin Fan• 2023

Related benchmarks

Task	Dataset	Result
Audio-driven facial animation	MEAD 41 (test)	PSNR28.12	26
Audio-driven facial animation	RAVDESS 42 (test)	PSNR27.706	24
Talking head synthesis	User Study	Lip Sync Quality4.304	18
Talking Face Generation	HDTF (test)	SSIM0.73	16
Talking Head Reenactment	General Inference (test)	FPS1.03	13
Talking Head Reenactment	General Inference	Inference Speed (FPS)1.03	13
Head reconstruction	Video sequences (test)	PSNR37.4017	11
Talking Head Generation	User Study	Lip Sync85.2	11
Talking head synthesis	May avatar Lieu audio	Sync-D7.508	10
Talking head synthesis	May avatar Shaheen audio	Sync-D8.903	10

Showing 10 of 16 rows

Other info

Code

Follow for update

@wizwand_team Discord