AU-Harness: An Open-Source Toolkit for Holistic Evaluation of Audio LLMs

About

Large Audio Language Models (LALMs) are rapidly advancing, but evaluating them remains challenging due to inefficient and non-standardized toolkits that limit fair comparison and systematic assessment. Existing evaluation frameworks exhibit three critical limitations: (1) slow and inefficient processing pipeline that bottlenecks large-scale studies, (2) inadequate multi-turn dialogue support, leaving fundamental questions about cross-turn context integration and performance dynamics over extended conversations in LALMs unanswered; and (3) the absence of unified and scalable evaluation framework capable of keeping pace with the rapid growth of both LALMs and audio benchmarks. To address these issues, we introduce AU-Harness, an efficient and comprehensive evaluation framework for LALMs. Our system achieves a speedup of up to 151% over existing evaluation toolkits through optimized batch processing and parallel execution, enabling large-scale evaluations previously considered impractical. We provide standardized prompting protocols and flexible configurations for fair model comparison across diverse scenarios. AU-Harness unlocks a range of in-depth analyses difficult to conduct without a unified foundation, including multi-turn dialogue dynamics, enabling the study of true audio reasoning capabilities in existing LALMs. AU-Harness provides both practical evaluation tools and insights into model limitations, advancing systematic LALM development.

Hoang Nguyen, Sidharth Surapaneni, Akshay Kalkunte, Jash Mehta, Aman Tiwari, Oluwanifemi Bamgbose, Khyati Mahajan, Jash Shah, Shruthan Radhakrishna, Sathwik Tejaswi Madhusudhan, Vikas Yadav, Sai Rajeswar• 2025

Related benchmarks

Task	Dataset	Result
Multi-turn dialogue	MT-Bench	--	126
Automatic Speech Recognition	LibriSpeech	--	35
Instruction Following	IFEval	--	21
Audio Understanding	AudioCaps	--	11
Hearing Disorder	StutterDetect	--	11
Multi-turn dialogue	SpokenWoz	--	11
Music Understanding	ChoMusic	--	11
Phoneme Recognition	voxangeles	--	11
Safety	AdvBench	--	11
Speaker & Language	SR	--	11

Showing 10 of 14 rows

Other info

Follow for update

@wizwand_team Discord