Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Long-form music generation with latent diffusion

About

Audio-based generative models for music have seen great strides recently, but so far have not managed to produce full-length music tracks with coherent musical structure from text prompts. We show that by training a generative model on long temporal contexts it is possible to produce long-form music of up to 4m45s. Our model consists of a diffusion-transformer operating on a highly downsampled continuous latent representation (latent rate of 21.5Hz). It obtains state-of-the-art generations according to metrics on audio quality and prompt alignment, and subjective tests reveal that it produces full-length music with coherent structure.

Zach Evans, Julian D. Parker, CJ Carr, Zack Zukowski, Josiah Taylor, Jordi Pons• 2024

Related benchmarks

TaskDatasetResultRank
Sound Morphing100 curated infusion prompts 1.0 (test)
LCS0.136
13
Text-to-Music GenerationSong Describer Dataset (full)
FD_openl372.17
5
Music GenerationSong Describer Dataset (test)
FDopenl381.05
5
Music GenerationSong Describer Dataset no-singing (test)
FDopenl379.09
4
Music GenerationSong Describer Dataset no-singing
FDopenl378.7
4
Music GenerationSong Describer Dataset no-singing 2m
Stereo Correctness96
3
Music GenerationSong Describer Dataset no-singing subset 4m 45s
Stereo Correctness100
2
Showing 7 of 7 rows

Other info

Code

Follow for update