GeneFace: Generalized and High-Fidelity Audio-Driven 3D Talking Face Synthesis

About

Generating photo-realistic video portrait with arbitrary speech audio is a crucial problem in film-making and virtual reality. Recently, several works explore the usage of neural radiance field in this task to improve 3D realness and image fidelity. However, the generalizability of previous NeRF-based methods to out-of-domain audio is limited by the small scale of training data. In this work, we propose GeneFace, a generalized and high-fidelity NeRF-based talking face generation method, which can generate natural results corresponding to various out-of-domain audio. Specifically, we learn a variaitional motion generator on a large lip-reading corpus, and introduce a domain adaptative post-net to calibrate the result. Moreover, we learn a NeRF-based renderer conditioned on the predicted facial motion. A head-aware torso-NeRF is proposed to eliminate the head-torso separation problem. Extensive experiments show that our method achieves more generalized and high-fidelity talking face generation compared to previous methods.

Zhenhui Ye, Ziyue Jiang, Yi Ren, Jinglin Liu, JinZheng He, Zhou Zhao• 2023

Related benchmarks

Task	Dataset	Result
Talking head synthesis	User Study	Lip Sync Quality2.982	18
Head reconstruction	Video sequences (test)	PSNR24.8165	11
Talking Head Reconstruction	Talking Head Reconstruction (test)	PSNR24.82	9
Lip synchronization	Cross-subject Lip Synchronization (Audio A)	LSE-D9.545	8
Lip synchronization	Cross-subject Lip Synchronization (Audio B)	LSE-D9.668	8
Audio-driven 3D Talking Head Generation	VASA generated synthetic video 1.0 (test)	Sc5.922	6
Lip synchronization	Well-edited video sequences (Audio A)	LSE-D9.5451	6
Lip synchronization	Well-edited video sequences (Audio B)	LSE-D9.6675	6
Talking Face Generation	Macron 512x512 (test)	LPIPS0.027	5
Talking Head Generation	Obama dataset (test)	CSIM0.819	5

Showing 10 of 10 rows

Other info

Follow for update

@wizwand_team Discord