Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

AI killed the video star. Audio-driven diffusion model for expressive talking head generation

About

We propose Dimitra++, a novel framework for audio-driven talking head generation, streamlined to learn lip motion, facial expression, as well as head pose motion. Specifically, we propose a conditional Motion Diffusion Transformer (cMDT) to model facial motion sequences, employing a 3D representation. The cMDT is conditioned on two inputs: a reference facial image, which determines appearance, as well as an audio sequence, which drives the motion. Quantitative and qualitative experiments, as well as a user study on two widely employed datasets, i.e., VoxCeleb2 and CelebV-HQ, suggest that Dimitra++ is able to outperform existing approaches in generating realistic talking heads imparting lip motion, facial expression, and head pose.

Baptiste Chopin, Tashvik Dhamija, Pranav Balaji, Yaohui Wang, Antitza Dantcheva• 2025

Related benchmarks

TaskDatasetResultRank
Talking Head GenerationCelebV-HQ
AHD12.37
9
Talking Head GenerationVoxCeleb2
User Preference (Dimitra++)93.3
6
Showing 2 of 2 rows

Other info

Follow for update