Narrating For You: Prompt-guided Audio-visual Narrating Face Generation Employing Multi-entangled Latent Space

About

We present a novel approach for generating realistic speaking and talking faces by synthesizing a person's voice and facial movements from a static image, a voice profile, and a target text. The model encodes the prompt/driving text, the driving image, and the voice profile of an individual and then combines them to pass them to the multi-entangled latent space to foster key-value pairs and queries for the audio and video modality generation pipeline. The multi-entangled latent space is responsible for establishing the spatiotemporal person-specific features between the modalities. Further, entangled features are passed to the respective decoder of each modality for output audio and video generation.

Aashish Chandra, Aashutosh A V, Abhijit Das• 2026

Related benchmarks

Task	Dataset	Result
Audio-visual synchronisation	Audio-visual synchronization benchmark	LSE-C5.71	7
Audio Generation	FakeAVCeleb	FAD171.5	5
Audio Generation	CelebV-HQ	FAD244.8	5
Audio Generation	HDTF	FAD106.4	5
Audio Quality Evaluation	Audio Evaluation Set	ESTOI43	5
Talking head synthesis	Talking Head Synthesis Datasets (test)	PSNR35.94	5
Video Generation	VoxCeleb	FID42.88	5
Video Generation	CelebV-HQ	FID34.01	5
Video Generation	HDTF	FID11.72	5
Audio Generation	VoxCeleb	FAD241.8	5

Showing 10 of 11 rows

Other info

Follow for update

@wizwand_team Discord