Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

AudioX: A Unified Framework for Anything-to-Audio Generation

About

Audio and music generation based on flexible multimodal control signals is a widely applicable topic, with the following key challenges: 1) a unified multimodal modeling framework, and 2) large-scale, high-quality training data. As such, we propose AudioX, a unified framework for anything-to-audio generation that integrates varied multimodal conditions (i.e., text, video, and audio signals) in this work. The core design in this framework is a Multimodal Adaptive Fusion module, which enables the effective fusion of diverse multimodal inputs, enhancing cross-modal alignment and improving overall generation quality. To train this unified model, we construct a large-scale, high-quality dataset, IF-caps, comprising over 7 million samples curated through a structured data annotation pipeline. This dataset provides comprehensive supervision for multimodal-conditioned audio generation. We benchmark AudioX against state-of-the-art methods across a wide range of tasks, finding that our model achieves superior performance, especially in text-to-audio and text-to-music generation. These results demonstrate our method is capable of audio generation under multimodal control signals, showing powerful instruction-following potential. The code and datasets will be available at https://zeyuet.github.io/AudioX/.

Zeyue Tian, Zhaoyang Liu, Yizhu Jin, Ruibin Yuan, Xu Tan, Qifeng Chen, Wei Xue, Yike Guo• 2025

Related benchmarks

TaskDatasetResultRank
Text-to-Audio GenerationAudioCaps (test)
FAD3.02
138
Text-to-AudioVGGSound-Omni (test)
KL Divergence1.59
10
Text-to-Audio GenerationOne-to-one evaluation benchmarks Text-to-Audio
FAD3.09
6
Text-to-Audio Generationevaluation benchmarks one-to-one
CLAP Score29.29
6
Text-to-Audio GenerationVGGSound
CLAP Score (Overall)33.93
5
Video-and-Text-to-Audio GenerationKling-Audio Eval
KL Divergence2.39
5
Video-to-AudioVGGSound-Omni (test)
KL Divergence2.96
5
Text-to-Audio GenerationKling-Audio Eval
KL Divergence2.73
3
Video-to-Audio GenerationKling-Audio Eval
KL Divergence3.13
3
Showing 9 of 9 rows

Other info

Follow for update