ALARM: Audio-Language Alignment for Reasoning Models

About

Large audio language models (ALMs) extend LLMs with auditory understanding. A common approach freezes the LLM and trains only an adapter on self-generated targets. However, this fails for reasoning LLMs (RLMs) whose built-in chain-of-thought traces expose the textual surrogate input, yielding unnatural responses. We propose self-rephrasing, converting self-generated responses into audio-understanding variants compatible with RLMs while preserving distributional alignment. We further fuse and compress multiple audio encoders for stronger representations. For training, we construct a 6M-instance multi-task corpus (2.5M unique prompts) spanning 19K hours of speech, music, and sound. Our 4B-parameter ALM outperforms similarly sized models and surpasses most larger ALMs on related audio-reasoning benchmarks, while preserving textual capabilities with a low training cost. Notably, we achieve the best open-source result on the MMAU-speech and MMSU benchmarks and rank third among all the models.

Petr Grinberg, Hassan Shahmohammadi• 2026

Related benchmarks

Task	Dataset	Result
Audio Understanding	MMAU v05.15.25 (test)	Sound Score61.1	53
Audio Understanding	MMSU	Perception Score45.4	37
Multimodal Audio Understanding	MMAU mini v05.15.25 (test)	Sound Accuracy66.4	25
Multimodal Audio Reasoning	MMAR	Mean Score48.7	22

Showing 4 of 4 rows

Other info

Follow for update

@wizwand_team Discord