Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

FLUX that Plays Music

About

This paper explores a simple extension of diffusion-based rectified flow Transformers for text-to-music generation, termed as FluxMusic. Generally, along with design in advanced Flux\footnote{https://github.com/black-forest-labs/flux} model, we transfers it into a latent VAE space of mel-spectrum. It involves first applying a sequence of independent attention to the double text-music stream, followed by a stacked single music stream for denoised patch prediction. We employ multiple pre-trained text encoders to sufficiently capture caption semantic information as well as inference flexibility. In between, coarse textual information, in conjunction with time step embeddings, is utilized in a modulation mechanism, while fine-grained textual details are concatenated with the music patch sequence as inputs. Through an in-depth study, we demonstrate that rectified flow training with an optimized architecture significantly outperforms established diffusion methods for the text-to-music task, as evidenced by various automatic metrics and human preference evaluations. Our experimental data, code, and model weights are made publicly available at: \url{https://github.com/feizc/FluxMusic}.

Zhengcong Fei, Mingyuan Fan, Changqian Yu, Junshi Huang• 2024

Related benchmarks

TaskDatasetResultRank
Text-to-Music GenerationMusicCaps
KLD1.25
11
Music GenerationSong Describer Dataset
FAD1.01
9
Text-to-Music GenerationHuman Evaluation (Experts)
OVL (Overall Likeness)3.35
4
Text-to-Music GenerationHuman Evaluation (Beginners)
OVL3.25
4
Showing 4 of 4 rows

Other info

Code

Follow for update