Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

HunyuanVideo-Foley: Multimodal Diffusion with Representation Alignment for High-Fidelity Foley Audio Generation

About

Recent advances in video generation produce visually realistic content, yet the absence of synchronized audio severely compromises immersion. To address key challenges in video-to-audio generation, including multimodal data scarcity, modality imbalance and limited audio quality in existing methods, we propose HunyuanVideo-Foley, an end-to-end text-video-to-audio framework that synthesizes high-fidelity audio precisely aligned with visual dynamics and semantic context. Our approach incorporates three core innovations: (1) a scalable data pipeline curating 100k-hour multimodal datasets through automated annotation; (2) a representation alignment strategy using self-supervised audio features to guide latent diffusion training, efficiently improving audio quality and generation stability; (3) a novel multimodal diffusion transformer resolving modal competition, containing dual-stream audio-video fusion through joint attention, and textual semantic injection via cross-attention. Comprehensive evaluations demonstrate that HunyuanVideo-Foley achieves new state-of-the-art performance across audio fidelity, visual-semantic alignment, temporal alignment and distribution matching. The demo page is available at: https://szczesnys.github.io/hunyuanvideo-foley/.

Sizhe Shan, Qiulin Li, Yutao Cui, Miles Yang, Yuehai Wang, Qun Yang, Jin Zhou, Zhao Zhong• 2025

Related benchmarks

TaskDatasetResultRank
Video-to-Audio GenerationVGGSound (test)--
83
Text-to-AudioVGGSound-Omni (test)
KL Divergence1.74
10
Video-to-Audio GenerationEchoFoley 6k
Temporal Control Score43
9
Video-to-Audio GenerationUnAV100
FD (VGG)4.89
8
Video-to-Audio GenerationLongVale
FD (VGG)14.56
8
Universal Holistic Audio GenerationUniHAGen-Bench 1.0 (test)
FAD6
7
Video-to-Audio GenerationKling-Eval (test)
FDPaSST202.1
7
Video-to-Audio GenerationAudioCanvas (out-of-domain)
CLAP44
7
Video-to-Audio GenerationVGGSound-Director (test)
FD (VGG)2.39
6
Controllable Audio GenerationDirectorBench
Counterfactual Precision22.41
5
Showing 10 of 13 rows

Other info

Follow for update