Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

SAM: A Mamba-2 State-Space Audio-Language Model

About

We present SAM, a State-space Audio-language Model that integrates an audio encoder with a Mamba-2 backbone. SAM-2.7B achieves 21.1 mAP on AudioSet and 17.6 SPICE on AudioCaps, matching or surpassing larger 7B transformer-based models with fewer parameters. We further provide the first systematic, representation-level analysis of how SSMs interact with audio encoder outputs: (1) joint audio encoder finetuning is essential, supported by accuracy gains and observed adaptation of token representation rank and similarity across different SSM sizes; (2) despite linear scaling, SSMs benefit more from compact, information-rich audio token representations than from excessively long token sequences; and (3) incorporating instruction-following supervision substantially improves reasoning ability, boosting MMAU-Sound accuracy from 22.8 to 56.8. Through comprehensive experiments and analysis, we establish practical design principles for SSMs as strong, scalable backbones for audio-language models.

Taehan Lee, Jaehan Jung, Hyukjun Lee• 2025

Related benchmarks

TaskDatasetResultRank
Audio ClassificationESC-50
Accuracy90
366
Environmental Sound ClassificationFSD50K
mAP49.2
91
Audio ClassificationVGG-Sound
Top-1 Accuracy57
83
Audio CaptioningClotho--
60
Audio ClassificationAudioSet
mAP21.1
46
Acoustic Scene ClassificationTUT Acoustic Scenes
Accuracy33.3
35
Audio ClassificationBeijing Opera
Base Accuracy69.1
34
Acoustic Scene ClassificationDCASE
Mi-F1 Score48.9
21
Audio ClassificationVocalSound
Accuracy72.2
21
Audio ReasoningMMAU v05.15.25 (mini)
Sound Score61.86
10
Showing 10 of 11 rows

Other info

Follow for update