Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

ALARM: Audio-Language Alignment for Reasoning Models

About

Large audio language models (ALMs) extend LLMs with auditory understanding. A common approach freezes the LLM and trains only an adapter on self-generated targets. However, this fails for reasoning LLMs (RLMs) whose built-in chain-of-thought traces expose the textual surrogate input, yielding unnatural responses. We propose self-rephrasing, converting self-generated responses into audio-understanding variants compatible with RLMs while preserving distributional alignment. We further fuse and compress multiple audio encoders for stronger representations. For training, we construct a 6M-instance multi-task corpus (2.5M unique prompts) spanning 19K hours of speech, music, and sound. Our 4B-parameter ALM outperforms similarly sized models and surpasses most larger ALMs on related audio-reasoning benchmarks, while preserving textual capabilities with a low training cost. Notably, we achieve the best open-source result on the MMAU-speech and MMSU benchmarks and rank third among all the models.

Petr Grinberg, Hassan Shahmohammadi• 2026

Related benchmarks

TaskDatasetResultRank
Audio UnderstandingMMAU v05.15.25 (test)
Sound Score61.1
53
Audio UnderstandingMMSU
Perception Score45.4
32
Multimodal Audio UnderstandingMMAU mini v05.15.25 (test)
Sound Accuracy66.4
25
Multimodal Audio ReasoningMMAR
Mean Score48.7
22
Showing 4 of 4 rows

Other info

Follow for update