Long-form music generation with latent diffusion

About

Audio-based generative models for music have seen great strides recently, but so far have not managed to produce full-length music tracks with coherent musical structure from text prompts. We show that by training a generative model on long temporal contexts it is possible to produce long-form music of up to 4m45s. Our model consists of a diffusion-transformer operating on a highly downsampled continuous latent representation (latent rate of 21.5Hz). It obtains state-of-the-art generations according to metrics on audio quality and prompt alignment, and subjective tests reveal that it produces full-length music with coherent structure.

Zach Evans, Julian D. Parker, CJ Carr, Zack Zukowski, Josiah Taylor, Jordi Pons• 2024

Related benchmarks

Task	Dataset	Result
Sound Morphing	100 curated infusion prompts 1.0 (test)	LCS0.136	13
Instrumental Music Generation	SDD 120s generations (test)	FAD0.106	6
Text-to-Music Generation	Song Describer Dataset (full)	FD_openl372.17	5
Instrumental Music Generation	SDD 190s generations	FAD0.128	5
Music Generation	Song Describer Dataset (test)	FDopenl381.05	5
Music Generation	Song Describer Dataset no-singing (test)	FDopenl379.09	4
Music Generation	Song Describer Dataset no-singing	FDopenl378.7	4
Music Generation	Song Describer Dataset no-singing 2m	Stereo Correctness96	3
Music Generation	Song Describer Dataset no-singing subset 4m 45s	Stereo Correctness100	2

Showing 9 of 9 rows

Other info

Code

Follow for update

@wizwand_team Discord