Long-form music generation with latent diffusion
About
Audio-based generative models for music have seen great strides recently, but so far have not managed to produce full-length music tracks with coherent musical structure from text prompts. We show that by training a generative model on long temporal contexts it is possible to produce long-form music of up to 4m45s. Our model consists of a diffusion-transformer operating on a highly downsampled continuous latent representation (latent rate of 21.5Hz). It obtains state-of-the-art generations according to metrics on audio quality and prompt alignment, and subjective tests reveal that it produces full-length music with coherent structure.
Zach Evans, Julian D. Parker, CJ Carr, Zack Zukowski, Josiah Taylor, Jordi Pons• 2024
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Sound Morphing | 100 curated infusion prompts 1.0 (test) | LCS0.136 | 13 | |
| Text-to-Music Generation | Song Describer Dataset (full) | FD_openl372.17 | 5 | |
| Music Generation | Song Describer Dataset (test) | FDopenl381.05 | 5 | |
| Music Generation | Song Describer Dataset no-singing (test) | FDopenl379.09 | 4 | |
| Music Generation | Song Describer Dataset no-singing | FDopenl378.7 | 4 | |
| Music Generation | Song Describer Dataset no-singing 2m | Stereo Correctness96 | 3 | |
| Music Generation | Song Describer Dataset no-singing subset 4m 45s | Stereo Correctness100 | 2 |
Showing 7 of 7 rows