Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

AU-Harness: An Open-Source Toolkit for Holistic Evaluation of Audio LLMs

About

Large Audio Language Models (LALMs) are rapidly advancing, but evaluating them remains challenging due to inefficient and non-standardized toolkits that limit fair comparison and systematic assessment. Existing evaluation frameworks exhibit three critical limitations: (1) slow and inefficient processing pipeline that bottlenecks large-scale studies, (2) inadequate multi-turn dialogue support, leaving fundamental questions about cross-turn context integration and performance dynamics over extended conversations in LALMs unanswered; and (3) the absence of unified and scalable evaluation framework capable of keeping pace with the rapid growth of both LALMs and audio benchmarks. To address these issues, we introduce AU-Harness, an efficient and comprehensive evaluation framework for LALMs. Our system achieves a speedup of up to 151% over existing evaluation toolkits through optimized batch processing and parallel execution, enabling large-scale evaluations previously considered impractical. We provide standardized prompting protocols and flexible configurations for fair model comparison across diverse scenarios. AU-Harness unlocks a range of in-depth analyses difficult to conduct without a unified foundation, including multi-turn dialogue dynamics, enabling the study of true audio reasoning capabilities in existing LALMs. AU-Harness provides both practical evaluation tools and insights into model limitations, advancing systematic LALM development.

Hoang Nguyen, Sidharth Surapaneni, Akshay Kalkunte, Jash Mehta, Aman Tiwari, Oluwanifemi Bamgbose, Khyati Mahajan, Jash Shah, Shruthan Radhakrishna, Sathwik Tejaswi Madhusudhan, Vikas Yadav, Sai Rajeswar• 2025

Related benchmarks

TaskDatasetResultRank
Multi-turn dialogueMT-Bench--
126
Automatic Speech RecognitionLibriSpeech--
35
Instruction FollowingIFEval--
21
Audio UnderstandingAudioCaps--
11
Hearing DisorderStutterDetect--
11
Multi-turn dialogueSpokenWoz--
11
Music UnderstandingChoMusic--
11
Phoneme Recognitionvoxangeles--
11
SafetyAdvBench--
11
Speaker & LanguageSR--
11
Showing 10 of 14 rows

Other info

Follow for update