Stable Audio Open
About
Open generative models are vitally important for the community, allowing for fine-tunes and serving as baselines when presenting new models. However, most current text-to-audio models are private and not accessible for artists and researchers to build upon. Here we describe the architecture and training process of a new open-weights text-to-audio model trained with Creative Commons data. Our evaluation shows that the model's performance is competitive with the state-of-the-art across various metrics. Notably, the reported FDopenl3 results (measuring the realism of the generations) showcase its potential for high-quality stereo sound synthesis at 44.1kHz.
Zach Evans, Julian D. Parker, CJ Carr, Zack Zukowski, Josiah Taylor, Jordi Pons• 2024
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Text-to-Audio Generation | AudioCaps (test) | FAD2.32 | 138 | |
| Audio Reconstruction | Song Describer | L/R Mel0.6863 | 10 | |
| Context Length Estimation | Song Describer | Context Length (s)106 | 10 | |
| Text-to-Audio | AudioSet Strong | F1 Event6.05 | 9 | |
| Continuous Audio Compression | 48 kHz Sound Effects | FAD0.78 | 7 | |
| Text-to-Audio | Text-to-Audio (test) | Loudness MAE17.49 | 7 | |
| Music Generation | Song Describer Dataset (test) | FDopenl396.51 | 5 | |
| Text-to-Audio | AudioCaps multi-event prompts | FDopenl388.5 | 5 | |
| Text-to-Music Generation | Song Describer Dataset (full) | FD_openl399.7 | 5 | |
| Text-to-Audio Generation | Human Evaluation Subjective Audio Assessment (test) | Z-Score (OVL)0.0723 | 4 |
Showing 10 of 12 rows