Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

FantasyTalking: Realistic Talking Portrait Generation via Coherent Motion Synthesis

About

Creating a realistic animatable avatar from a single static portrait remains challenging. Existing approaches often struggle to capture subtle facial expressions, the associated global body movements, and the dynamic background. To address these limitations, we propose a novel framework that leverages a pretrained video diffusion transformer model to generate high-fidelity, coherent talking portraits with controllable motion dynamics. At the core of our work is a dual-stage audio-visual alignment strategy. In the first stage, we employ a clip-level training scheme to establish coherent global motion by aligning audio-driven dynamics across the entire scene, including the reference portrait, contextual objects, and background. In the second stage, we refine lip movements at the frame level using a lip-tracing mask, ensuring precise synchronization with audio signals. To preserve identity without compromising motion flexibility, we replace the commonly used reference network with a facial-focused cross-attention module that effectively maintains facial consistency throughout the video. Furthermore, we integrate a motion intensity modulation module that explicitly controls expression and body motion intensity, enabling controllable manipulation of portrait movements beyond mere lip motion. Extensive experimental results show that our proposed approach achieves higher quality with better realism, coherence, motion intensity, and identity preservation. Ours project page: https://fantasy-amap.github.io/fantasy-talking/.

Mengchao Wang, Qiang Wang, Fan Jiang, Yaqi Fan, Yunpeng Zhang, Yonggang Qi, Kun Zhao, Mu Xu• 2025

Related benchmarks

TaskDatasetResultRank
Talking Head GenerationHDTF (test)
FID25.615
49
Talking Head GenerationHDTF
FID16.489
33
Talking Avatar GenerationCelebV-HQ (clips)
FID43.14
10
Talking Avatar Generationlong-form videos (test)
FID144.7
10
Talking head video generationAction Bench (test)
Sync-C4.209
9
Talking Head GenerationMead
FID46.617
8
Audio-driven video generationHDTF
FID24.03
8
Audio-driven video generationMead
FID45.24
8
Talking Head GenerationFoundation capability evaluation set
IQA4.01
7
Human AnimationHuman Animation Evaluation Set
Sync-C4.05
6
Showing 10 of 16 rows

Other info

Follow for update