MuseControlLite: Multifunctional Music Generation with Lightweight Conditioners

About

We propose MuseControlLite, a lightweight mechanism designed to fine-tune text-to-music generation models for precise conditioning using various time-varying musical attributes and reference audio signals. The key finding is that positional embeddings, which have been seldom used by text-to-music generation models in the conditioner for text conditions, are critical when the condition of interest is a function of time. Using melody control as an example, our experiments show that simply adding rotary positional embeddings to the decoupled cross-attention layers increases control accuracy from 56.6% to 61.1%, while requiring 6.75 times fewer trainable parameters than state-of-the-art fine-tuning mechanisms, using the same pre-trained diffusion Transformer model of Stable Audio Open. We evaluate various forms of musical attribute control, audio inpainting, and audio outpainting, demonstrating improved controllability over MusicGen-Large and Stable Audio Open ControlNet at a significantly lower fine-tuning cost, with only 85M trainble parameters. Source code, model checkpoints, and demo examples are available at: https://musecontrollite.github.io/web/.

Fang-Duo Tsai, Shih-Lun Wu, Weijaw Lee, Sheng-Ping Yang, Bo-Rui Chen, Hao-Chung Cheng, Yi-Hsuan Yang• 2025

Related benchmarks

Task	Dataset	Result
Music Editing	Music Editing Subjective (evaluation)	Target Attribute Match (T)3.03	6
Music Editing	ZoME-Bench Genre	CLAP29.3	6
Music Editing	ZoME-Bench Instrument	CLAP25	6
Cover Song Generation	SongEval (test)	CLAP0.26	5
Aesthetic Evaluation	SongEval	Coherence3.389	4
Aesthetic Evaluation	Suno70k	Coherence3.144	4
Cover Song Generation	Cover Song Generation (w/ Music Background)	MF2.63	3
Cover Song Generation	Cover Song Generation w/o Music Background	MF Score2.689	3
Music Generation	Suno70k academic short-form	FAD52.1	3

Showing 9 of 9 rows

Other info

Follow for update

@wizwand_team Discord