Real-time Neural Radiance Talking Portrait Synthesis via Audio-spatial Decomposition

About

While dynamic Neural Radiance Fields (NeRF) have shown success in high-fidelity 3D modeling of talking portraits, the slow training and inference speed severely obstruct their potential usage. In this paper, we propose an efficient NeRF-based framework that enables real-time synthesizing of talking portraits and faster convergence by leveraging the recent success of grid-based NeRF. Our key insight is to decompose the inherently high-dimensional talking portrait representation into three low-dimensional feature grids. Specifically, a Decomposed Audio-spatial Encoding Module models the dynamic head with a 3D spatial grid and a 2D audio grid. The torso is handled with another 2D grid in a lightweight Pseudo-3D Deformable Module. Both modules focus on efficiency under the premise of good rendering quality. Extensive experiments demonstrate that our method can generate realistic and audio-lips synchronized talking portrait videos, while also being highly efficient compared to previous methods.

Jiaxiang Tang, Kaisiyuan Wang, Hang Zhou, Xiaokang Chen, Dongliang He, Tianshu Hu, Jingtuo Liu, Gang Zeng, Jingdong Wang• 2022

Related benchmarks

Task	Dataset	Result
3D Talking Face Generation	HDTF	NIQE24.068	12
Personalized 3D Talking Face Generation	HDTF	PSNR28.82	12
Head reconstruction	Video sequences (test)	PSNR31.7754	11
Talking head synthesis	May avatar Shaheen audio	Sync-D12.012	10
Talking head synthesis	May avatar Lieu audio	Sync-D12.044	10
Talking Head Reconstruction	Talking Head Reconstruction (test)	PSNR31.78	9
Lip synchronization	Cross-subject Lip Synchronization (Audio A)	LSE-D11.639	8
Lip synchronization	Cross-subject Lip Synchronization (Audio B)	LSE-D11.082	8
Talking Face Generation	User Study (test)	Lip-sync Accuracy5.53	8
Talking head synthesis	Portrait Video Self-reconstruction (test)	PSNR31.95	8

Showing 10 of 14 rows

Other info

Follow for update

@wizwand_team Discord