Narrating For You: Prompt-guided Audio-visual Narrating Face Generation Employing Multi-entangled Latent Space
About
We present a novel approach for generating realistic speaking and talking faces by synthesizing a person's voice and facial movements from a static image, a voice profile, and a target text. The model encodes the prompt/driving text, the driving image, and the voice profile of an individual and then combines them to pass them to the multi-entangled latent space to foster key-value pairs and queries for the audio and video modality generation pipeline. The multi-entangled latent space is responsible for establishing the spatiotemporal person-specific features between the modalities. Further, entangled features are passed to the respective decoder of each modality for output audio and video generation.
Aashish Chandra, Aashutosh A V, Abhijit Das• 2026
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Audio-visual synchronisation | Audio-visual synchronization benchmark | LSE-C5.71 | 7 | |
| Audio Generation | FakeAVCeleb | FAD171.5 | 5 | |
| Audio Generation | CelebV-HQ | FAD244.8 | 5 | |
| Audio Generation | HDTF | FAD106.4 | 5 | |
| Audio Quality Evaluation | Audio Evaluation Set | ESTOI43 | 5 | |
| Talking head synthesis | Talking Head Synthesis Datasets (test) | PSNR35.94 | 5 | |
| Video Generation | VoxCeleb | FID42.88 | 5 | |
| Video Generation | CelebV-HQ | FID34.01 | 5 | |
| Video Generation | HDTF | FID11.72 | 5 | |
| Audio Generation | VoxCeleb | FAD241.8 | 5 |
Showing 10 of 11 rows