MuMu-LLaMA: Multi-modal Music Understanding and Generation via Large Language Models

About

Research on large language models has advanced significantly across text, speech, images, and videos. However, multi-modal music understanding and generation remain underexplored due to the lack of well-annotated datasets. To address this, we introduce a dataset with 167.69 hours of multi-modal data, including text, images, videos, and music annotations. Based on this dataset, we propose MuMu-LLaMA, a model that leverages pre-trained encoders for music, images, and videos. For music generation, we integrate AudioLDM 2 and MusicGen. Our evaluation across four tasks--music understanding, text-to-music generation, prompt-based music editing, and multi-modal music generation--demonstrates that MuMu-LLaMA outperforms state-of-the-art models, showing its potential for multi-modal music applications.

Shansong Liu, Atin Sakkeer Hussain, Qilong Wu, Chenshuo Sun, Ying Shan• 2024

Related benchmarks

Task	Dataset	Result
Multimodal Understanding	MMMU	MMMU Score12.87	102
Multimodal Understanding	MMSU	MMSU Score6.32	14
Video-to-Music Generation	V2M-bench (test)	Fréchet Audio Distance (FAD)52.25	12
Video-to-Music Generation	OES-Pub	FAD*6.67	7
Video-to-Music Generation	MovieGenBench Music	FAD5.84	7
Video-to-Music Generation	ReelBench	IB0.0722	7
Video-to-Music Generation	V2MBench	IB0.1229	7
Video-to-Music Generation	LORIS	IB Score0.0694	7
Text-to-Music	Musicaps (test)	FAD5.89	6
Audio-visual alignment	ReelBench	Rhythm Alignment Score2.49	4

Showing 10 of 10 rows

Other info

Follow for update

@wizwand_team Discord