SAM: A Mamba-2 State-Space Audio-Language Model
About
We present SAM, a State-space Audio-language Model that integrates an audio encoder with a Mamba-2 backbone. SAM-2.7B achieves 21.1 mAP on AudioSet and 17.6 SPICE on AudioCaps, matching or surpassing larger 7B transformer-based models with fewer parameters. We further provide the first systematic, representation-level analysis of how SSMs interact with audio encoder outputs: (1) joint audio encoder finetuning is essential, supported by accuracy gains and observed adaptation of token representation rank and similarity across different SSM sizes; (2) despite linear scaling, SSMs benefit more from compact, information-rich audio token representations than from excessively long token sequences; and (3) incorporating instruction-following supervision substantially improves reasoning ability, boosting MMAU-Sound accuracy from 22.8 to 56.8. Through comprehensive experiments and analysis, we establish practical design principles for SSMs as strong, scalable backbones for audio-language models.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Audio Classification | ESC-50 | Accuracy90 | 366 | |
| Environmental Sound Classification | FSD50K | mAP49.2 | 91 | |
| Audio Classification | VGG-Sound | Top-1 Accuracy57 | 83 | |
| Audio Captioning | Clotho | -- | 60 | |
| Audio Classification | AudioSet | mAP21.1 | 46 | |
| Acoustic Scene Classification | TUT Acoustic Scenes | Accuracy33.3 | 35 | |
| Audio Classification | Beijing Opera | Base Accuracy69.1 | 34 | |
| Acoustic Scene Classification | DCASE | Mi-F1 Score48.9 | 21 | |
| Audio Classification | VocalSound | Accuracy72.2 | 21 | |
| Audio Reasoning | MMAU v05.15.25 (mini) | Sound Score61.86 | 10 |