Stable Audio Open
About
Open generative models are vitally important for the community, allowing for fine-tunes and serving as baselines when presenting new models. However, most current text-to-audio models are private and not accessible for artists and researchers to build upon. Here we describe the architecture and training process of a new open-weights text-to-audio model trained with Creative Commons data. Our evaluation shows that the model's performance is competitive with the state-of-the-art across various metrics. Notably, the reported FDopenl3 results (measuring the realism of the generations) showcase its potential for high-quality stereo sound synthesis at 44.1kHz.
Zach Evans, Julian D. Parker, CJ Carr, Zack Zukowski, Josiah Taylor, Jordi Pons• 2024
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Text-to-Audio Generation | AudioCaps (test) | KL Divergence2.14 | 195 | |
| Sound effects generation | Sound Effects (test) | FAD0.364 | 22 | |
| Text-to-Music Generation | ATTM Grand Challenge Prompts 1.0 (test) | FAD0.574 | 14 | |
| Text-to-Music Generation | 100 Official Final Prompts (test) | ms-CLAP Score0.507 | 13 | |
| Audio Super-Resolution | VCTK 24 kHz (test) | LSD0.831 | 11 | |
| Text-to-Audio Generation | AudioCaps (evaluation) | FAD4.05 | 11 | |
| Text-conditioned music generation | Song Describer Dataset | FD96.51 | 11 | |
| Audio Reconstruction | Song Describer | L/R Mel0.6863 | 10 | |
| Text-to-Audio | AudioCaps | FD (OpenL3)2.36 | 10 | |
| Context Length Estimation | Song Describer | Context Length (s)106 | 10 |
Showing 10 of 39 rows