Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

GaMMA: Towards Joint Global-Temporal Music Understanding in Large Multimodal Models

About

In this paper, we propose GaMMA, a state-of-the-art (SoTA) large multimodal model (LMM) designed to achieve comprehensive musical content understanding. GaMMA inherits the streamlined encoder-decoder design of LLaVA, enabling effective cross-modal learning between music and language. By incorporating audio encoders in a mixture-of-experts manner, GaMMA effectively unifies both time-series and non-time-series music understanding tasks within one set of parameters. Our approach combines carefully curated datasets at scale with a progressive training pipeline, effectively pushing the boundaries of music understanding via pretraining, supervised fine-tuning (SFT), and reinforcement learning (RL). To comprehensively assess both temporal and non-temporal capability of music LMMs, we introduce MusicBench, the largest music-oriented benchmark, comprising 3,739 human-curated multiple-choice questions covering diverse aspects of musical understanding. Extensive experiments demonstrate that GaMMA establishes new SoTA in the music domain, achieving 79.1% accuracy on MuchoMusic, 79.3% on MusicBench-Temporal, and 81.3% on MusicBench-Global, consistently outperforming previous methods.

Zuyao You, Zhesong Yu, Mingyu Liu, Bilei Zhu, Yuan Wan, Zuxuan Wu• 2026

Related benchmarks

TaskDatasetResultRank
Audio UnderstandingMMAU v05.15.25 (test-mini)
Sound Score79.9
54
Music UnderstandingMusicBench Global
PG Score96.8
13
Music UnderstandingMuChoMusic
CC Score78.6
13
Music UnderstandingMusicBench Temporal
Chords Score75
11
Global Music AnalysisExpert Study Global Dimensions
Vocal95.8
3
Temporal Music AnalysisExpert Study Temporal Dimensions
FVCP Score88.4
3
Showing 6 of 6 rows

Other info

Follow for update