Audio-Driven Talking Face Generation with Blink Embedding and Hash Grid Landmarks Encoding

About

Dynamic Neural Radiance Fields (NeRF) have demonstrated considerable success in generating high-fidelity 3D models of talking portraits. Despite significant advancements in the rendering speed and generation quality, challenges persist in accurately and efficiently capturing mouth movements in talking portraits. To tackle this challenge, we propose an automatic method based on blink embedding and hash grid landmarks encoding in this study, which can substantially enhance the fidelity of talking faces. Specifically, we leverage facial features encoded as conditional features and integrate audio features as residual terms into our model through a Dynamic Landmark Transformer. Furthermore, we employ neural radiance fields to model the entire face, resulting in a lifelike face representation. Experimental evaluations have validated the superiority of our approach to existing methods.

Yuhui Zhang, Hui Yu, Wei Liang, Sunjie Zhang• 2026

Related benchmarks

Task	Dataset	Result	Rank
Talking Face Generation	Macron 512x512 (test)	LPIPS0.025		5

Showing 1 of 1 rows

Other info

Follow for update

@wizwand_team Discord