HunyuanCustom: A Multimodal-Driven Architecture for Customized Video Generation
About
Customized video generation aims to produce videos featuring specific subjects under flexible user-defined conditions, yet existing methods often struggle with identity consistency and limited input modalities. In this paper, we propose HunyuanCustom, a multi-modal customized video generation framework that emphasizes subject consistency while supporting image, audio, video, and text conditions. Built upon HunyuanVideo, our model first addresses the image-text conditioned generation task by introducing a text-image fusion module based on LLaVA for enhanced multi-modal understanding, along with an image ID enhancement module that leverages temporal concatenation to reinforce identity features across frames. To enable audio- and video-conditioned generation, we further propose modality-specific condition injection mechanisms: an AudioNet module that achieves hierarchical alignment via spatial cross-attention, and a video-driven injection module that integrates latent-compressed conditional video through a patchify-based feature-alignment network. Extensive experiments on single- and multi-subject scenarios demonstrate that HunyuanCustom significantly outperforms state-of-the-art open- and closed-source methods in terms of ID consistency, realism, and text-video alignment. Moreover, we validate its robustness across downstream tasks, including audio and video-driven customized video generation. Our results highlight the effectiveness of multi-modal conditioning and identity-preserving strategies in advancing controllable video generation. All the code and models are available at https://hunyuancustom.github.io.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Single-ID Video Generation | Single-ID (evaluation) | ID-Sim59.2 | 13 | |
| Video Customization | 70-example benchmark 1.0 (test) | FaceSim Arc0.53 | 9 | |
| Identity-consistent video generation | User Study 15 identities | Face Similarity Score3.197 | 8 | |
| Audio-visual generation | R2AV 1.0 (test) | AES0.589 | 7 | |
| Character Replacement | Synthesized benchmark | SSIM0.644 | 4 | |
| Video Character Replacement | VBench real-world | Subject Consistency90.03 | 4 | |
| Audio-driven animation | IDBench-Omni RA2V 1.0 | AES0.567 | 3 | |
| Controlled Video Editing | RV2AV IDBench-Omni 1.0 (test) | AES0.538 | 3 |