GaussianTalker: Real-Time High-Fidelity Talking Head Synthesis with Audio-Driven 3D Gaussian Splatting

About

We propose GaussianTalker, a novel framework for real-time generation of pose-controllable talking heads. It leverages the fast rendering capabilities of 3D Gaussian Splatting (3DGS) while addressing the challenges of directly controlling 3DGS with speech audio. GaussianTalker constructs a canonical 3DGS representation of the head and deforms it in sync with the audio. A key insight is to encode the 3D Gaussian attributes into a shared implicit feature representation, where it is merged with audio features to manipulate each Gaussian attribute. This design exploits the spatial-aware features and enforces interactions between neighboring points. The feature embeddings are then fed to a spatial-audio attention module, which predicts frame-wise offsets for the attributes of each Gaussian. It is more stable than previous concatenation or multiplication approaches for manipulating the numerous Gaussians and their intricate parameters. Experimental results showcase GaussianTalker's superiority in facial fidelity, lip synchronization accuracy, and rendering speed compared to previous methods. Specifically, GaussianTalker achieves a remarkable rendering speed up to 120 FPS, surpassing previous benchmarks. Our code is made available at https://github.com/KU-CVLAB/GaussianTalker/ .

Kyusun Cho, Joungbin Lee, Heeji Yoon, Yeobin Hong, Jaehoon Ko, Sangjun Ahn, Seungryong Kim• 2024

Related benchmarks

Task	Dataset	Result
Audio-driven facial animation	MEAD 41 (test)	PSNR28.911	26
Audio-driven facial animation	RAVDESS 42 (test)	PSNR28.516	24
Talking Head Reenactment	General Inference (test)	FPS76.802	13
Talking Head Reenactment	General Inference	Inference Speed (FPS)76.802	13
Personalized 3D Talking Face Generation	HDTF	PSNR29.82	12
3D Talking Face Generation	HDTF	NIQE24.351	12
Talking head synthesis	May avatar Shaheen audio	Sync-D8.926	10
Talking head synthesis	May avatar Lieu audio	Sync-D10.943	10
Talking Face Generation	User Study (test)	Lip-sync Accuracy5.75	8
Talking head synthesis	Portrait Video Self-reconstruction (test)	PSNR32.69	8

Showing 10 of 13 rows

Other info

Follow for update

@wizwand_team Discord