GaMMA: Towards Joint Global-Temporal Music Understanding in Large Multimodal Models

About

In this paper, we propose GaMMA, a state-of-the-art (SoTA) large multimodal model (LMM) designed to achieve comprehensive musical content understanding. GaMMA inherits the streamlined encoder-decoder design of LLaVA, enabling effective cross-modal learning between music and language. By incorporating audio encoders in a mixture-of-experts manner, GaMMA effectively unifies both time-series and non-time-series music understanding tasks within one set of parameters. Our approach combines carefully curated datasets at scale with a progressive training pipeline, effectively pushing the boundaries of music understanding via pretraining, supervised fine-tuning (SFT), and reinforcement learning (RL). To comprehensively assess both temporal and non-temporal capability of music LMMs, we introduce MusicBench, the largest music-oriented benchmark, comprising 3,739 human-curated multiple-choice questions covering diverse aspects of musical understanding. Extensive experiments demonstrate that GaMMA establishes new SoTA in the music domain, achieving 79.1% accuracy on MuchoMusic, 79.3% on MusicBench-Temporal, and 81.3% on MusicBench-Global, consistently outperforming previous methods.

Zuyao You, Zhesong Yu, Mingyu Liu, Bilei Zhu, Yuan Wan, Zuxuan Wu• 2026

Related benchmarks

Task	Dataset	Result
Audio Understanding	MMAU v05.15.25 (test-mini)	Sound Score79.9	54
Music Understanding	MusicBench Global	PG Score96.8	13
Music Understanding	MuChoMusic	CC Score78.6	13
Music Understanding	MusicBench Temporal	Chords Score75	11
Global Music Analysis	Expert Study Global Dimensions	Vocal95.8	3
Temporal Music Analysis	Expert Study Temporal Dimensions	FVCP Score88.4	3

Showing 6 of 6 rows

Other info

Follow for update

@wizwand_team Discord