Music Understanding LLaMA: Advancing Text-to-Music Generation with Question Answering and Captioning
About
Text-to-music generation (T2M-Gen) faces a major obstacle due to the scarcity of large-scale publicly available music datasets with natural language captions. To address this, we propose the Music Understanding LLaMA (MU-LLaMA), capable of answering music-related questions and generating captions for music files. Our model utilizes audio representations from a pretrained MERT model to extract music features. However, obtaining a suitable dataset for training the MU-LLaMA model remains challenging, as existing publicly accessible audio question answering datasets lack the necessary depth for open-ended music question answering. To fill this gap, we present a methodology for generating question-answer pairs from existing audio captioning datasets and introduce the MusicQA Dataset designed for answering open-ended music-related questions. The experiments demonstrate that the proposed MU-LLaMA model, trained on our designed MusicQA dataset, achieves outstanding performance in both music question answering and music caption generation across various metrics, outperforming current state-of-the-art (SOTA) models in both fields and offering a promising advancement in the T2M-Gen research field.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Music Genre Classification | GTZAN | Accuracy37.3 | 62 | |
| Audio Understanding | MMAU v05.15.25 (test) | Sound Score31 | 53 | |
| Multimodal Audio Understanding | MMAU mini v05.15.25 (test) | Sound Accuracy33 | 25 | |
| Multimodal Audio Reasoning | MMAR | Mean Score13.9 | 22 | |
| Music Understanding | MusicBench Global | PG Score27.3 | 13 | |
| Music Understanding | MuChoMusic | CC Score25 | 13 | |
| Music Understanding | MusicBench Temporal | Chords Score15.5 | 11 | |
| Music Reasoning | MuChoMusic | Overall Accuracy32.7 | 8 | |
| Instrument Recognition | Medley-Solos-DB | Accuracy38.6 | 8 | |
| Music Captioning | MusicCaps (test) | METEOR12.3 | 8 |