Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

HunyuanVideo-Avatar: High-Fidelity Audio-Driven Human Animation for Multiple Characters

About

Recent years have witnessed significant progress in audio-driven human animation. However, critical challenges remain in (i) generating highly dynamic videos while preserving character consistency, (ii) achieving precise emotion alignment between characters and audio, and (iii) enabling multi-character audio-driven animation. To address these challenges, we propose HunyuanVideo-Avatar, a multimodal diffusion transformer (MM-DiT)-based model capable of simultaneously generating dynamic, emotion-controllable, and multi-character dialogue videos. Concretely, HunyuanVideo-Avatar introduces three key innovations: (i) A character image injection module is designed to replace the conventional addition-based character conditioning scheme, eliminating the inherent condition mismatch between training and inference. This ensures the dynamic motion and strong character consistency; (ii) An Audio Emotion Module (AEM) is introduced to extract and transfer the emotional cues from an emotion reference image to the target generated video, enabling fine-grained and accurate emotion style control; (iii) A Face-Aware Audio Adapter (FAA) is proposed to isolate the audio-driven character with latent-level face mask, enabling independent audio injection via cross-attention for multi-character scenarios. These innovations empower HunyuanVideo-Avatar to surpass state-of-the-art methods on benchmark datasets and a newly proposed wild dataset, generating realistic avatars in dynamic, immersive scenarios.

Yi Chen, Sen Liang, Zixiang Zhou, Ziyao Huang, Yifeng Ma, Junshu Tang, Qin Lin, Yuan Zhou, Qinglin Lu• 2025

Related benchmarks

TaskDatasetResultRank
Text-to-Video GenerationVBench--
111
Talking Head GenerationHDTF (test)
FVD322.8
33
Image-to-Video lip-syncingTalkVid
FID57.05
12
Talking avatar video generationEMTD (test)
FID63.09
10
Talking avatar video generationShort dataset real avatar images, 5s audio 1.0
FID76.49
10
Talking avatar video generationLong dataset 25 synthesized avatar images, 20s audio clips 1.0
ASE4.8
10
Talking head video generationAction Bench (test)
Sync-C6.251
9
Audio-driven video generationEMTD (test)
FID18.07
6
Audio-driven Digital Human GenerationShort Sequence
Sync-C6.12
6
Human-Object Interaction Video GenerationGroundedInter Audio-Driven 1.0 (test)
VLM-QA0.2688
5
Showing 10 of 11 rows

Other info

Follow for update