TalkingGaussian: Structure-Persistent 3D Talking Head Synthesis via Gaussian Splatting

About

Radiance fields have demonstrated impressive performance in synthesizing lifelike 3D talking heads. However, due to the difficulty in fitting steep appearance changes, the prevailing paradigm that presents facial motions by directly modifying point appearance may lead to distortions in dynamic regions. To tackle this challenge, we introduce TalkingGaussian, a deformation-based radiance fields framework for high-fidelity talking head synthesis. Leveraging the point-based Gaussian Splatting, facial motions can be represented in our method by applying smooth and continuous deformations to persistent Gaussian primitives, without requiring to learn the difficult appearance change like previous methods. Due to this simplification, precise facial motions can be synthesized while keeping a highly intact facial feature. Under such a deformation paradigm, we further identify a face-mouth motion inconsistency that would affect the learning of detailed speaking motions. To address this conflict, we decompose the model into two branches separately for the face and inside mouth areas, therefore simplifying the learning tasks to help reconstruct more accurate motion and structure of the mouth region. Extensive experiments demonstrate that our method renders high-quality lip-synchronized talking head videos, with better facial fidelity and higher efficiency compared with previous methods.

Jiahe Li, Jiawei Zhang, Xiao Bai, Jin Zheng, Xin Ning, Jun Zhou, Lin Gu• 2024

Related benchmarks

Task	Dataset	Result
Personalized 3D Talking Face Generation	HDTF	PSNR29.59	12
3D Talking Face Generation	HDTF	NIQE25.128	12
Talking Head Generation	User Study	Lip Sync144	11
Talking head synthesis	May avatar Shaheen audio	Sync-D11.45	10
Talking head synthesis	May avatar Lieu audio	Sync-D7.439	10
Talking Head Reconstruction	Talking Head Reconstruction (test)	PSNR32.4	9
Audio-visual Synchronization	HDTF cross-driven	Sync-C (Cross-Gender)5.258	8
Talking Face Generation	User Study (test)	Lip-sync Accuracy6.12	8
Talking head synthesis	Curated 5-Identity Audio-Visual Dataset (Macron, Paul, Obama, May, Stabenow) (test)	PSNR31.674	8
Talking head synthesis	Portrait Video Self-reconstruction (test)	PSNR33.75	8

Showing 10 of 27 rows

Other info

Follow for update

@wizwand_team Discord