Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

EvA: An Evidence-First Audio Understanding Paradigm for LALMs

About

Large Audio Language Models (LALMs) still struggle in complex acoustic scenes because they often fail to preserve task-relevant acoustic evidence before reasoning begins. We identify this error pattern as the evidence bottleneck: state-of-the-art systems show larger deficits in acoustic evidence extraction than in downstream reasoning, suggesting that upstream perception is often the limiting factor. To address this problem, we propose EvA (Evidence-First Audio), a dual-path architecture that enhances acoustic evidence preservation through hierarchical aggregation and non-compressive, time-aligned fusion. We also build EvA-Perception, a large-scale training set with about 54K event-ordered captions and 500K evidence-grounded QA pairs. Under a unified zero-shot protocol, EvA achieves the best open-source \emph{Perception} results on MMAU, MMAR, and MMSU, with the largest gains on perception-heavy splits. Human evaluation on open-ended captioning further shows improved fine-grained acoustic coverage and caption quality. These results support the evidence-first hypothesis: stronger audio understanding depends on preserving acoustic evidence before reasoning. Project can be found at https://satsuki2486441738.github.io/EvA/.

Xinyuan Xie, Shunian Chen, Zhiheng Liu, Yuhao Zhang, Zhiqiang Lv, Liyin Liang, Benyou Wang• 2026

Related benchmarks

TaskDatasetResultRank
Audio UnderstandingMMSU
Perception Score47.52
37
Acoustic Scene ClassificationCochlScene
ACC87.04
17
Audio UnderstandingMMAR--
15
Audio UnderstandingMMAU
Perception Score78.64
7
Showing 4 of 4 rows

Other info

Follow for update