Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Identity-Preserving Talking Face Generation with Landmark and Appearance Priors

About

Generating talking face videos from audio attracts lots of research interest. A few person-specific methods can generate vivid videos but require the target speaker's videos for training or fine-tuning. Existing person-generic methods have difficulty in generating realistic and lip-synced videos while preserving identity information. To tackle this problem, we propose a two-stage framework consisting of audio-to-landmark generation and landmark-to-video rendering procedures. First, we devise a novel Transformer-based landmark generator to infer lip and jaw landmarks from the audio. Prior landmark characteristics of the speaker's face are employed to make the generated landmarks coincide with the facial outline of the speaker. Then, a video rendering model is built to translate the generated landmarks into face images. During this stage, prior appearance information is extracted from the lower-half occluded target face and static reference images, which helps generate realistic and identity-preserving visual content. For effectively exploring the prior information of static reference images, we align static reference images with the target face's pose and expression based on motion fields. Moreover, auditory features are reused to guarantee that the generated face images are well synchronized with the audio. Extensive experiments demonstrate that our method can produce more realistic, lip-synced, and identity-preserving videos than existing person-generic talking face generation methods.

Weizhi Zhong, Chaowei Fang, Yinqi Cai, Pengxu Wei, Gangming Zhao, Liang Lin, Guanbin Li• 2023

Related benchmarks

TaskDatasetResultRank
Talking Face GenerationLRS2 (test)
SSIM0.9399
18
Visual DubbingContextDubBench 1.0 (test)
FID14.891
18
Talking head synthesisUser Study
Lip Sync Quality3.161
18
Head reconstructionVideo sequences (test)
PSNR35.1525
11
Audio Driven Talking Head GenerationHDTF 51 (test)
SSIM0.874
9
Visual DubbingUser Study
Realism2.74
9
Visual DubbingHDTF (test)
PSNR28.571
9
Cross-Audio Talking Head GenerationHDTF, CelebV-HQ, and CelebV-Text 100 cross-audio pairs
FID9.05
8
Talking Head ReconstructionHDTF, CelebV-HQ, and CelebV-Text 100 randomly sampled reconstruction videos
FID7.91
8
Lip-audio synchronizationHDTF, CelebV-HQ, and CelebV-Text
FPS4.24
8
Showing 10 of 20 rows

Other info

Code

Follow for update