Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

HunyuanCustom: A Multimodal-Driven Architecture for Customized Video Generation

About

Customized video generation aims to produce videos featuring specific subjects under flexible user-defined conditions, yet existing methods often struggle with identity consistency and limited input modalities. In this paper, we propose HunyuanCustom, a multi-modal customized video generation framework that emphasizes subject consistency while supporting image, audio, video, and text conditions. Built upon HunyuanVideo, our model first addresses the image-text conditioned generation task by introducing a text-image fusion module based on LLaVA for enhanced multi-modal understanding, along with an image ID enhancement module that leverages temporal concatenation to reinforce identity features across frames. To enable audio- and video-conditioned generation, we further propose modality-specific condition injection mechanisms: an AudioNet module that achieves hierarchical alignment via spatial cross-attention, and a video-driven injection module that integrates latent-compressed conditional video through a patchify-based feature-alignment network. Extensive experiments on single- and multi-subject scenarios demonstrate that HunyuanCustom significantly outperforms state-of-the-art open- and closed-source methods in terms of ID consistency, realism, and text-video alignment. Moreover, we validate its robustness across downstream tasks, including audio and video-driven customized video generation. Our results highlight the effectiveness of multi-modal conditioning and identity-preserving strategies in advancing controllable video generation. All the code and models are available at https://hunyuancustom.github.io.

Teng Hu, Zhentao Yu, Zhengguang Zhou, Sen Liang, Yuan Zhou, Qin Lin, Qinglin Lu• 2025

Related benchmarks

TaskDatasetResultRank
Identity-Preserving Video GenerationOpenS2V (test)
Face Similarity0.622
17
Single-ID Video GenerationSingle-ID (evaluation)
ID-Sim59.2
13
Video Customization70-example benchmark 1.0 (test)
FaceSim Arc0.53
9
Video Re-SynthesisDAVIS (test)
PSNR21.5
8
Identity-consistent video generationUser Study 15 identities
Face Similarity Score3.197
8
Video Re-SynthesisTRACE manually curated (test)
PSNR780.1
8
Joint audio-video generationIdentity-aware T2AV (test)
AES0.553
7
Text+Reference-to-Video (R2V) GenerationHOIVG-Bench 1.0 (test)
TA7.523
7
Audio-visual generationR2AV 1.0 (test)
AES0.589
7
Identity-Preserving Video GenerationUser Study
Face Similarity Score3.34
6
Showing 10 of 17 rows

Other info

Follow for update