HunyuanVideo-Foley: Multimodal Diffusion with Representation Alignment for High-Fidelity Foley Audio Generation

About

Recent advances in video generation produce visually realistic content, yet the absence of synchronized audio severely compromises immersion. To address key challenges in video-to-audio generation, including multimodal data scarcity, modality imbalance and limited audio quality in existing methods, we propose HunyuanVideo-Foley, an end-to-end text-video-to-audio framework that synthesizes high-fidelity audio precisely aligned with visual dynamics and semantic context. Our approach incorporates three core innovations: (1) a scalable data pipeline curating 100k-hour multimodal datasets through automated annotation; (2) a representation alignment strategy using self-supervised audio features to guide latent diffusion training, efficiently improving audio quality and generation stability; (3) a novel multimodal diffusion transformer resolving modal competition, containing dual-stream audio-video fusion through joint attention, and textual semantic injection via cross-attention. Comprehensive evaluations demonstrate that HunyuanVideo-Foley achieves new state-of-the-art performance across audio fidelity, visual-semantic alignment, temporal alignment and distribution matching. The demo page is available at: https://szczesnys.github.io/hunyuanvideo-foley/.

Sizhe Shan, Qiulin Li, Yutao Cui, Miles Yang, Yuehai Wang, Qun Yang, Jin Zhou, Zhao Zhong• 2025

Related benchmarks

Task	Dataset	Result
Video-to-Audio Generation	VGGSound (test)	--	95
Video-to-Audio Generation	VGGSound	FD_VGG2.18	32
Text-to-Audio	VGGSound-Omni (test)	KL Divergence1.74	10
Video-to-Audio Generation	EchoFoley 6k	Temporal Control Score43	9
Video-to-Audio Generation	UnAV100	FD (VGG)4.89	8
Video-to-Audio Generation	LongVale	FD (VGG)14.56	8
Universal Holistic Audio Generation	UniHAGen-Bench 1.0 (test)	FAD6	7
Video-to-Audio Generation	Kling-Eval (test)	FDPaSST202.1	7
Video-to-Audio Generation	AudioCanvas (out-of-domain)	CLAP44	7
Video-to-Audio Generation	VGGSound-Director (test)	FD (VGG)2.39	6

Showing 10 of 14 rows

Other info

Follow for update

@wizwand_team Discord