Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Stable Audio Open

About

Open generative models are vitally important for the community, allowing for fine-tunes and serving as baselines when presenting new models. However, most current text-to-audio models are private and not accessible for artists and researchers to build upon. Here we describe the architecture and training process of a new open-weights text-to-audio model trained with Creative Commons data. Our evaluation shows that the model's performance is competitive with the state-of-the-art across various metrics. Notably, the reported FDopenl3 results (measuring the realism of the generations) showcase its potential for high-quality stereo sound synthesis at 44.1kHz.

Zach Evans, Julian D. Parker, CJ Carr, Zack Zukowski, Josiah Taylor, Jordi Pons• 2024

Related benchmarks

TaskDatasetResultRank
Text-to-Audio GenerationAudioCaps (test)
KL Divergence2.14
195
Sound effects generationSound Effects (test)
FAD0.364
22
Text-to-Music GenerationATTM Grand Challenge Prompts 1.0 (test)
FAD0.574
14
Text-to-Music Generation100 Official Final Prompts (test)
ms-CLAP Score0.507
13
Audio Super-ResolutionVCTK 24 kHz (test)
LSD0.831
11
Text-to-Audio GenerationAudioCaps (evaluation)
FAD4.05
11
Text-conditioned music generationSong Describer Dataset
FD96.51
11
Audio ReconstructionSong Describer
L/R Mel0.6863
10
Text-to-AudioAudioCaps
FD (OpenL3)2.36
10
Context Length EstimationSong Describer
Context Length (s)106
10
Showing 10 of 39 rows

Other info

Code

Follow for update