Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

HunyuanVideo-Foley: Multimodal Diffusion with Representation Alignment for High-Fidelity Foley Audio Generation

About

Recent advances in video generation produce visually realistic content, yet the absence of synchronized audio severely compromises immersion. To address key challenges in video-to-audio generation, including multimodal data scarcity, modality imbalance and limited audio quality in existing methods, we propose HunyuanVideo-Foley, an end-to-end text-video-to-audio framework that synthesizes high-fidelity audio precisely aligned with visual dynamics and semantic context. Our approach incorporates three core innovations: (1) a scalable data pipeline curating 100k-hour multimodal datasets through automated annotation; (2) a representation alignment strategy using self-supervised audio features to guide latent diffusion training, efficiently improving audio quality and generation stability; (3) a novel multimodal diffusion transformer resolving modal competition, containing dual-stream audio-video fusion through joint attention, and textual semantic injection via cross-attention. Comprehensive evaluations demonstrate that HunyuanVideo-Foley achieves new state-of-the-art performance across audio fidelity, visual-semantic alignment, temporal alignment and distribution matching. The demo page is available at: https://szczesnys.github.io/hunyuanvideo-foley/.

Sizhe Shan, Qiulin Li, Yutao Cui, Miles Yang, Yuehai Wang, Qun Yang, Jin Zhou, Zhao Zhong• 2025

Related benchmarks

TaskDatasetResultRank
Video-to-Audio GenerationVGGSound (test)--
62
Text-to-AudioVGGSound-Omni (test)
KL Divergence1.74
10
Video-to-Audio GenerationEchoFoley 6k
Temporal Control Score43
9
Video-to-Audio GenerationUnAV100
FD (VGG)4.89
8
Video-to-Audio GenerationLongVale
FD (VGG)14.56
8
Video-to-Audio GenerationKling-Eval (test)
FDPaSST202.1
7
Video-and-Text-to-Audio GenerationKling-Audio Eval
KL Divergence2.13
5
Showing 7 of 7 rows

Other info

Follow for update