Music Understanding LLaMA: Advancing Text-to-Music Generation with Question Answering and Captioning

About

Text-to-music generation (T2M-Gen) faces a major obstacle due to the scarcity of large-scale publicly available music datasets with natural language captions. To address this, we propose the Music Understanding LLaMA (MU-LLaMA), capable of answering music-related questions and generating captions for music files. Our model utilizes audio representations from a pretrained MERT model to extract music features. However, obtaining a suitable dataset for training the MU-LLaMA model remains challenging, as existing publicly accessible audio question answering datasets lack the necessary depth for open-ended music question answering. To fill this gap, we present a methodology for generating question-answer pairs from existing audio captioning datasets and introduce the MusicQA Dataset designed for answering open-ended music-related questions. The experiments demonstrate that the proposed MU-LLaMA model, trained on our designed MusicQA dataset, achieves outstanding performance in both music question answering and music caption generation across various metrics, outperforming current state-of-the-art (SOTA) models in both fields and offering a promising advancement in the T2M-Gen research field.

Shansong Liu, Atin Sakkeer Hussain, Chenshuo Sun, Ying Shan• 2023

Related benchmarks

Task	Dataset	Result
Music Genre Classification	GTZAN	Accuracy37.3	62
Audio Understanding	MMAU v05.15.25 (test)	Sound Score31	53
Multimodal Audio Understanding	MMAU mini v05.15.25 (test)	Sound Accuracy33	25
Multimodal Audio Reasoning	MMAR	Mean Score13.9	22
Music Understanding	MusicBench Global	PG Score27.3	13
Music Understanding	MuChoMusic	CC Score25	13
Music Understanding	MusicBench Temporal	Chords Score15.5	11
Music Reasoning	MuChoMusic	Overall Accuracy32.7	8
Instrument Recognition	Medley-Solos-DB	Accuracy38.6	8
Music Captioning	MusicCaps (test)	METEOR12.3	8

Showing 10 of 13 rows

Other info

Follow for update

@wizwand_team Discord