Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Narrating For You: Prompt-guided Audio-visual Narrating Face Generation Employing Multi-entangled Latent Space

About

We present a novel approach for generating realistic speaking and talking faces by synthesizing a person's voice and facial movements from a static image, a voice profile, and a target text. The model encodes the prompt/driving text, the driving image, and the voice profile of an individual and then combines them to pass them to the multi-entangled latent space to foster key-value pairs and queries for the audio and video modality generation pipeline. The multi-entangled latent space is responsible for establishing the spatiotemporal person-specific features between the modalities. Further, entangled features are passed to the respective decoder of each modality for output audio and video generation.

Aashish Chandra, Aashutosh A V, Abhijit Das• 2026

Related benchmarks

TaskDatasetResultRank
Audio-visual synchronisationAudio-visual synchronization benchmark
LSE-C5.71
7
Audio GenerationFakeAVCeleb
FAD171.5
5
Audio GenerationCelebV-HQ
FAD244.8
5
Audio GenerationHDTF
FAD106.4
5
Audio Quality EvaluationAudio Evaluation Set
ESTOI43
5
Talking head synthesisTalking Head Synthesis Datasets (test)
PSNR35.94
5
Video GenerationVoxCeleb
FID42.88
5
Video GenerationCelebV-HQ
FID34.01
5
Video GenerationHDTF
FID11.72
5
Audio GenerationVoxCeleb
FAD241.8
5
Showing 10 of 11 rows

Other info

Follow for update