Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

AudioToolAgent: An Agentic Framework for Audio-Language Models

About

Large Audio-Language Models (LALMs) perform well on audio understanding tasks but lack multistep reasoning and tool-calling found in recent Large Language Models (LLMs). This paper presents AudioToolAgent, a framework that coordinates audio-language models as tools via a central LLM agent that accesses tool adapters for audio question answering and speech-to-text. The agent reasons about which tools to invoke, how to formulate follow-up queries, and how to arbitrate conflicting tool outputs, without accessing the audio. Experiments with MMAU, MMAR, and MMAU-Pro show state-of-the-art accuracy: up to 77.50% in MMAU, 77.00% in MMAR, and 61.90% in MMAU-Pro. Shapley-based analysis identifies effective agent-tool combinations. The code and reproduction materials are available at https://github.com/GLJS/AudioToolAgent.

Gijs Wijngaard, Elia Formisano, Michel Dumontier, Jenia Jitsev• 2025

Related benchmarks

TaskDatasetResultRank
Audio ReasoningMMAU mini 1.0 (test)
Sound Score81.68
15
Audio ReasoningMMAU-Pro
Sound54.96
11
Audio ReasoningMMAR
Sound Accuracy73.33
8
Showing 3 of 3 rows

Other info

Follow for update