MuMu-LLaMA: Multi-modal Music Understanding and Generation via Large Language Models
About
Research on large language models has advanced significantly across text, speech, images, and videos. However, multi-modal music understanding and generation remain underexplored due to the lack of well-annotated datasets. To address this, we introduce a dataset with 167.69 hours of multi-modal data, including text, images, videos, and music annotations. Based on this dataset, we propose MuMu-LLaMA, a model that leverages pre-trained encoders for music, images, and videos. For music generation, we integrate AudioLDM 2 and MusicGen. Our evaluation across four tasks--music understanding, text-to-music generation, prompt-based music editing, and multi-modal music generation--demonstrates that MuMu-LLaMA outperforms state-of-the-art models, showing its potential for multi-modal music applications.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Multimodal Understanding | MMMU | MMMU Score12.87 | 59 | |
| Video-to-Music Generation | V2M-bench (test) | Fréchet Audio Distance (FAD)52.25 | 12 | |
| Video-to-Music Generation | OES-Pub | FAD*6.67 | 7 | |
| Video-to-Music Generation | MovieGenBench Music | FAD5.84 | 7 | |
| Multimodal Understanding | MMSU | MMSU Score6.32 | 7 | |
| Text-to-Music | Musicaps (test) | FAD5.89 | 6 |