Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

IM-Portrait: Learning 3D-aware Video Diffusion for Photorealistic Talking Heads from Monocular Videos

About

We propose a novel 3D-aware diffusion-based method for generating photorealistic talking head videos directly from a single identity image and explicit control signals (e.g., expressions). Our method generates Multiplane Images (MPIs) that ensure geometric consistency, making them ideal for immersive viewing experiences like binocular videos for VR headsets. Unlike existing methods that often require a separate stage or joint optimization to reconstruct a 3D representation (such as NeRF or 3D Gaussians), our approach directly generates the final output through a single denoising process, eliminating the need for post-processing steps to render novel views efficiently. To effectively learn from monocular videos, we introduce a training mechanism that reconstructs the output MPI randomly in either the target or the reference camera space. This approach enables the model to simultaneously learn sharp image details and underlying 3D information. Extensive experiments demonstrate the effectiveness of our method, which achieves competitive avatar quality and novel-view rendering capabilities, even without explicit 3D reconstruction or high-quality multi-view training data.

Yuan Li, Ziqian Bai, Feitong Tan, Zhaopeng Cui, Sean Fanello, Yinda Zhang• 2025

Related benchmarks

TaskDatasetResultRank
Video-driven Talking Head Generation (Self-Reenactment)HDTF
FID18.12
12
Talking head video generationHDTF
FID14.76
8
Talking head video generationTalkinghead1kh
FID27.83
8
Cross-identity reenactmentHDTF
FVD107.9
6
Talking head synthesisVFHQ (first 100 frames)
FID33.1
6
Talking head synthesisSelf-Collected Dataset 50 identities
FID36.98
6
Talking head synthesisHDTF
PSNR24.83
5
Talking head synthesisTalkinghead1kh
PSNR22.43
5
Showing 8 of 8 rows

Other info

Follow for update