LeVo: High-Quality Song Generation with Multi-Preference Alignment

About

Recent advances in large language models (LLMs) and audio language models have significantly improved music generation, particularly in lyrics-to-song generation. However, existing approaches still struggle with the complex composition of songs and the scarcity of high-quality data, leading to limitations in audio quality, musicality, instruction following, and vocal-instrument harmony. To address these challenges, we introduce LeVo, a language model based framework consisting of LeLM and Music Codec. LeLM is capable of parallel modeling of two types of tokens: mixed tokens, which represent the combined audio of vocals and accompaniment to achieve better vocal-instrument harmony, and dual-track tokens, which separately encode vocals and accompaniment for high-quality song generation. It employs two decoder-only transformers and a modular extension training strategy to prevent interference between different token types. To further enhance musicality and instruction following ability, we introduce a multi-preference alignment method based on Direct Preference Optimization (DPO). This method handles diverse human preferences through a semi-automatic data construction process and post-training. Experimental results demonstrate that LeVo significantly outperforms existing open-source methods in both objective and subjective metrics, while performing competitively with industry systems. Ablation studies further justify the effectiveness of our designs. Audio examples and source code are available at https://levo-demo.github.io and https://github.com/tencent-ailab/songgeneration.

Shun Lei, Yaoxun Xu, Zhiwei Lin, Huaicheng Zhang, Wei Tan, Hangting Chen, Jianwei Yu, Yixuan Zhang, Chenyu Yang, Haina Zhu, Shuai Wang, Zhiyong Wu, Dong Yu• 2025

Related benchmarks

Task	Dataset	Result
Audio Tagging	MTT	MTT AP26	19
Song Generation	Song Generation Evaluation Set (test)	OVL3.71	15
Music Reconstruction	Music Reconstruction (Evaluation Set)	VISQOL3.26	13
Music Generation	Bilingual (Chinese/English) music 20 styles (test)	CE7.61	11
Music Generation	HeartBeats Benchmark Chinese	CE7.63	10
Music Generation	HeartBeats Benchmark English	CE Score7.55	10
Song Generation	100 Chinese and 100 English songs (val)	FAD3.73	8
Neural Audio Coding	Codec Benchmark	cnBPT50	8
Song Generation	AudioBox aesthetic	CE7.43	6
Song Generation	Mandarin pop songs (test)	PER29.8	6

Showing 10 of 22 rows

Other info

Follow for update

@wizwand_team Discord