Stable Audio Open

About

Open generative models are vitally important for the community, allowing for fine-tunes and serving as baselines when presenting new models. However, most current text-to-audio models are private and not accessible for artists and researchers to build upon. Here we describe the architecture and training process of a new open-weights text-to-audio model trained with Creative Commons data. Our evaluation shows that the model's performance is competitive with the state-of-the-art across various metrics. Notably, the reported FDopenl3 results (measuring the realism of the generations) showcase its potential for high-quality stereo sound synthesis at 44.1kHz.

Zach Evans, Julian D. Parker, CJ Carr, Zack Zukowski, Josiah Taylor, Jordi Pons• 2024

Related benchmarks

Task	Dataset	Result
Text-to-Audio Generation	AudioCaps (test)	KL Divergence2.14	195
Sound effects generation	Sound Effects (test)	FAD0.364	22
Text-to-Music Generation	ATTM Grand Challenge Prompts 1.0 (test)	FAD0.574	14
Text-to-Music Generation	100 Official Final Prompts (test)	ms-CLAP Score0.507	13
Audio Super-Resolution	VCTK 24 kHz (test)	LSD0.831	11
Text-to-Audio Generation	AudioCaps (evaluation)	FAD4.05	11
Text-conditioned music generation	Song Describer Dataset	FD96.51	11
Audio Reconstruction	Song Describer	L/R Mel0.6863	10
Text-to-Audio	AudioCaps	FD (OpenL3)2.36	10
Context Length Estimation	Song Describer	Context Length (s)106	10

Showing 10 of 39 rows

Other info

Code

Follow for update

@wizwand_team Discord