SAM: A Mamba-2 State-Space Audio-Language Model

About

We present SAM, a State-space Audio-language Model that integrates an audio encoder with a Mamba-2 backbone. SAM-2.7B achieves 21.1 mAP on AudioSet and 17.6 SPICE on AudioCaps, matching or surpassing larger 7B transformer-based models with fewer parameters. We further provide the first systematic, representation-level analysis of how SSMs interact with audio encoder outputs: (1) joint audio encoder finetuning is essential, supported by accuracy gains and observed adaptation of token representation rank and similarity across different SSM sizes; (2) despite linear scaling, SSMs benefit more from compact, information-rich audio token representations than from excessively long token sequences; and (3) incorporating instruction-following supervision substantially improves reasoning ability, boosting MMAU-Sound accuracy from 22.8 to 56.8. Through comprehensive experiments and analysis, we establish practical design principles for SSMs as strong, scalable backbones for audio-language models.

Taehan Lee, Jaehan Jung, Hyukjun Lee• 2025

Related benchmarks

Task	Dataset	Result
Audio Classification	ESC-50	Accuracy90	441
Environmental Sound Classification	FSD50K	mAP49.2	91
Audio Classification	VGG-Sound	Top-1 Accuracy57	83
Audio Captioning	Clotho	--	82
Audio Classification	AudioSet	mAP21.1	60
Acoustic Scene Classification	TUT Acoustic Scenes	Accuracy33.3	35
Audio Classification	Beijing Opera	Base Accuracy69.1	34
Audio Classification	VocalSound	Accuracy72.2	26
Acoustic Scene Classification	DCASE	Mi-F1 Score48.9	21
Audio Reasoning	MMAU v05.15.25 (mini)	Sound Score61.86	10

Showing 10 of 11 rows

Other info

Follow for update

@wizwand_team Discord