Audio-Thinker: Guiding Audio Language Model When and How to Think via Reinforcement Learning

About

Recent advancements in large language models, multimodal large language models, and large audio language models (LALMs) have significantly improved their reasoning capabilities through reinforcement learning with rule-based rewards. However, the explicit reasoning process has yet to show significant benefits for audio question answering, and effectively leveraging deep reasoning remains an open challenge, with LALMs still falling short of human-level auditory-language reasoning. To address these limitations, we propose Audio-Thinker, a reinforcement learning framework designed to enhance the reasoning capabilities of LALMs, with a focus on improving adaptability, consistency, and effectiveness. Our approach introduces an adaptive think accuracy reward, enabling the model to adjust its reasoning strategies based on task complexity dynamically. Furthermore, we incorporate an external reward model to evaluate the overall consistency and quality of the reasoning process, complemented by think-based rewards that help the model distinguish between valid and flawed reasoning paths during training. Experimental results demonstrate that our Audio-Thinker model outperforms existing reasoning-oriented LALMs across various benchmark tasks, exhibiting superior reasoning and generalization capabilities.

Shu Wu, Chenxing Li, Wenfu Wang, Hao Zhang, Hualei Wang, Meng Yu, Dong Yu• 2025

Related benchmarks

Task	Dataset	Result
Audio Reasoning	MMAR (test)	Average Score65.3	57
Audio Question Answering	MMAR	Average Score67.25	55
Audio Understanding	MMAU v05.15.25 (test-mini)	Sound Score81.98	54
Audio Understanding	MMAU (test)	--	31
Audio Understanding	MMAU mini original (test)	Accuracy (Sound Domain)77.48	21
Audio Understanding	MMAU mini (test)	Accuracy78	20
Audio Understanding & Reasoning	MMAU	MMAU Score75.9	15
Multimodal Audio Understanding	MMAU Mini	Sound Score81.98	13
Speech Reasoning	MMAU Speech mini	Speech Score73.37	11
Speech Reasoning	MMAR-Speech	Speech Accuracy64.29	11

Showing 10 of 14 rows

Other info

Follow for update

@wizwand_team Discord