HunyuanVideo-Avatar: High-Fidelity Audio-Driven Human Animation for Multiple Characters
About
Recent years have witnessed significant progress in audio-driven human animation. However, critical challenges remain in (i) generating highly dynamic videos while preserving character consistency, (ii) achieving precise emotion alignment between characters and audio, and (iii) enabling multi-character audio-driven animation. To address these challenges, we propose HunyuanVideo-Avatar, a multimodal diffusion transformer (MM-DiT)-based model capable of simultaneously generating dynamic, emotion-controllable, and multi-character dialogue videos. Concretely, HunyuanVideo-Avatar introduces three key innovations: (i) A character image injection module is designed to replace the conventional addition-based character conditioning scheme, eliminating the inherent condition mismatch between training and inference. This ensures the dynamic motion and strong character consistency; (ii) An Audio Emotion Module (AEM) is introduced to extract and transfer the emotional cues from an emotion reference image to the target generated video, enabling fine-grained and accurate emotion style control; (iii) A Face-Aware Audio Adapter (FAA) is proposed to isolate the audio-driven character with latent-level face mask, enabling independent audio injection via cross-attention for multi-character scenarios. These innovations empower HunyuanVideo-Avatar to surpass state-of-the-art methods on benchmark datasets and a newly proposed wild dataset, generating realistic avatars in dynamic, immersive scenarios.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Text-to-Video Generation | VBench | -- | 111 | |
| Talking Head Generation | HDTF (test) | FVD322.8 | 33 | |
| Image-to-Video lip-syncing | TalkVid | FID57.05 | 12 | |
| Talking avatar video generation | EMTD (test) | FID63.09 | 10 | |
| Talking avatar video generation | Short dataset real avatar images, 5s audio 1.0 | FID76.49 | 10 | |
| Talking avatar video generation | Long dataset 25 synthesized avatar images, 20s audio clips 1.0 | ASE4.8 | 10 | |
| Talking head video generation | Action Bench (test) | Sync-C6.251 | 9 | |
| Audio-driven video generation | EMTD (test) | FID18.07 | 6 | |
| Audio-driven Digital Human Generation | Short Sequence | Sync-C6.12 | 6 | |
| Human-Object Interaction Video Generation | GroundedInter Audio-Driven 1.0 (test) | VLM-QA0.2688 | 5 |