GeneFace: Generalized and High-Fidelity Audio-Driven 3D Talking Face Synthesis
About
Generating photo-realistic video portrait with arbitrary speech audio is a crucial problem in film-making and virtual reality. Recently, several works explore the usage of neural radiance field in this task to improve 3D realness and image fidelity. However, the generalizability of previous NeRF-based methods to out-of-domain audio is limited by the small scale of training data. In this work, we propose GeneFace, a generalized and high-fidelity NeRF-based talking face generation method, which can generate natural results corresponding to various out-of-domain audio. Specifically, we learn a variaitional motion generator on a large lip-reading corpus, and introduce a domain adaptative post-net to calibrate the result. Moreover, we learn a NeRF-based renderer conditioned on the predicted facial motion. A head-aware torso-NeRF is proposed to eliminate the head-torso separation problem. Extensive experiments show that our method achieves more generalized and high-fidelity talking face generation compared to previous methods.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Talking head synthesis | User Study | Lip Sync Quality2.982 | 18 | |
| Head reconstruction | Video sequences (test) | PSNR24.8165 | 11 | |
| Talking Head Reconstruction | Talking Head Reconstruction (test) | PSNR24.82 | 9 | |
| Lip synchronization | Cross-subject Lip Synchronization (Audio A) | LSE-D9.545 | 8 | |
| Lip synchronization | Cross-subject Lip Synchronization (Audio B) | LSE-D9.668 | 8 | |
| Audio-driven 3D Talking Head Generation | VASA generated synthetic video 1.0 (test) | Sc5.922 | 6 | |
| Lip synchronization | Well-edited video sequences (Audio A) | LSE-D9.5451 | 6 | |
| Lip synchronization | Well-edited video sequences (Audio B) | LSE-D9.6675 | 6 | |
| Talking Face Generation | Macron 512x512 (test) | LPIPS0.027 | 5 | |
| Talking Head Generation | Obama dataset (test) | CSIM0.819 | 5 |