AudioX: A Unified Framework for Anything-to-Audio Generation
About
Audio and music generation based on flexible multimodal control signals is a widely applicable topic, with the following key challenges: 1) a unified multimodal modeling framework, and 2) large-scale, high-quality training data. As such, we propose AudioX, a unified framework for anything-to-audio generation that integrates varied multimodal conditions (i.e., text, video, and audio signals) in this work. The core design in this framework is a Multimodal Adaptive Fusion module, which enables the effective fusion of diverse multimodal inputs, enhancing cross-modal alignment and improving overall generation quality. To train this unified model, we construct a large-scale, high-quality dataset, IF-caps, comprising over 7 million samples curated through a structured data annotation pipeline. This dataset provides comprehensive supervision for multimodal-conditioned audio generation. We benchmark AudioX against state-of-the-art methods across a wide range of tasks, finding that our model achieves superior performance, especially in text-to-audio and text-to-music generation. These results demonstrate our method is capable of audio generation under multimodal control signals, showing powerful instruction-following potential. The code and datasets will be available at https://zeyuet.github.io/AudioX/.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Text-to-Audio Generation | AudioCaps (test) | FAD3.02 | 138 | |
| Text-to-Audio | VGGSound-Omni (test) | KL Divergence1.59 | 10 | |
| Text-to-Audio Generation | One-to-one evaluation benchmarks Text-to-Audio | FAD3.09 | 6 | |
| Text-to-Audio Generation | evaluation benchmarks one-to-one | CLAP Score29.29 | 6 | |
| Text-to-Audio Generation | VGGSound | CLAP Score (Overall)33.93 | 5 | |
| Video-and-Text-to-Audio Generation | Kling-Audio Eval | KL Divergence2.39 | 5 | |
| Video-to-Audio | VGGSound-Omni (test) | KL Divergence2.96 | 5 | |
| Text-to-Audio Generation | Kling-Audio Eval | KL Divergence2.73 | 3 | |
| Video-to-Audio Generation | Kling-Audio Eval | KL Divergence3.13 | 3 |