AU-Harness: An Open-Source Toolkit for Holistic Evaluation of Audio LLMs
About
Large Audio Language Models (LALMs) are rapidly advancing, but evaluating them remains challenging due to inefficient and non-standardized toolkits that limit fair comparison and systematic assessment. Existing evaluation frameworks exhibit three critical limitations: (1) slow and inefficient processing pipeline that bottlenecks large-scale studies, (2) inadequate multi-turn dialogue support, leaving fundamental questions about cross-turn context integration and performance dynamics over extended conversations in LALMs unanswered; and (3) the absence of unified and scalable evaluation framework capable of keeping pace with the rapid growth of both LALMs and audio benchmarks. To address these issues, we introduce AU-Harness, an efficient and comprehensive evaluation framework for LALMs. Our system achieves a speedup of up to 151% over existing evaluation toolkits through optimized batch processing and parallel execution, enabling large-scale evaluations previously considered impractical. We provide standardized prompting protocols and flexible configurations for fair model comparison across diverse scenarios. AU-Harness unlocks a range of in-depth analyses difficult to conduct without a unified foundation, including multi-turn dialogue dynamics, enabling the study of true audio reasoning capabilities in existing LALMs. AU-Harness provides both practical evaluation tools and insights into model limitations, advancing systematic LALM development.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Multi-turn dialogue | MT-Bench | -- | 126 | |
| Automatic Speech Recognition | LibriSpeech | -- | 35 | |
| Instruction Following | IFEval | -- | 21 | |
| Audio Understanding | AudioCaps | -- | 11 | |
| Hearing Disorder | StutterDetect | -- | 11 | |
| Multi-turn dialogue | SpokenWoz | -- | 11 | |
| Music Understanding | ChoMusic | -- | 11 | |
| Phoneme Recognition | voxangeles | -- | 11 | |
| Safety | AdvBench | -- | 11 | |
| Speaker & Language | SR | -- | 11 |