AudioX: A Unified Framework for Anything-to-Audio Generation

About

Audio and music generation based on flexible multimodal control signals is a widely applicable topic, with the following key challenges: 1) a unified multimodal modeling framework, and 2) large-scale, high-quality training data. As such, we propose AudioX, a unified framework for anything-to-audio generation that integrates varied multimodal conditions (i.e., text, video, and audio signals) in this work. The core design in this framework is a Multimodal Adaptive Fusion module, which enables the effective fusion of diverse multimodal inputs, enhancing cross-modal alignment and improving overall generation quality. To train this unified model, we construct a large-scale, high-quality dataset, IF-caps, comprising over 7 million samples curated through a structured data annotation pipeline. This dataset provides comprehensive supervision for multimodal-conditioned audio generation. We benchmark AudioX against state-of-the-art methods across a wide range of tasks, finding that our model achieves superior performance, especially in text-to-audio and text-to-music generation. These results demonstrate our method is capable of audio generation under multimodal control signals, showing powerful instruction-following potential. The code and datasets will be available at https://zeyuet.github.io/AudioX/.

Zeyue Tian, Zhaoyang Liu, Yizhu Jin, Ruibin Yuan, Liumeng Xue, Xu Tan, Qifeng Chen, Wei Xue, Yike Guo• 2025

Related benchmarks

Task	Dataset	Result
Text-to-Audio Generation	AudioCaps (test)	KL Divergence1.37	213
Video-to-Audio	VGGSound (test)	--	25
Text-to-Audio Instruction Following	T2ABench	Count Accuracy (Cnt-acc)12.4	18
Text-to-Audio Instruction Following	AudioTime	Ordering Accuracy34	18
Music Generation	MusicCaps (test)	FAD1.42	16
Text-to-Audio Generation	VGGSound	Fréchet Audio Distance (FAD)4.44	14
Text-to-Music	Downstream Audio Generation (TTM)	CLAP Score0.386	12
Video-to-Music Generation	V2M-bench (test)	Fréchet Audio Distance (FAD)2.12	12
Text-to-Audio	VGGSound-Omni (test)	KL Divergence1.59	10
Text-to-Audio Generation	VGGSound (test)	KL Divergence1.29	10

Showing 10 of 23 rows

Other info

Follow for update

@wizwand_team Discord