EvA: An Evidence-First Audio Understanding Paradigm for LALMs

About

Large Audio Language Models (LALMs) still struggle in complex acoustic scenes because they often fail to preserve task-relevant acoustic evidence before reasoning begins. We identify this error pattern as the evidence bottleneck: state-of-the-art systems show larger deficits in acoustic evidence extraction than in downstream reasoning, suggesting that upstream perception is often the limiting factor. To address this problem, we propose EvA (Evidence-First Audio), a dual-path architecture that enhances acoustic evidence preservation through hierarchical aggregation and non-compressive, time-aligned fusion. We also build EvA-Perception, a large-scale training set with about 54K event-ordered captions and 500K evidence-grounded QA pairs. Under a unified zero-shot protocol, EvA achieves the best open-source \emph{Perception} results on MMAU, MMAR, and MMSU, with the largest gains on perception-heavy splits. Human evaluation on open-ended captioning further shows improved fine-grained acoustic coverage and caption quality. These results support the evidence-first hypothesis: stronger audio understanding depends on preserving acoustic evidence before reasoning. Project can be found at https://satsuki2486441738.github.io/EvA/.

Xinyuan Xie, Shunian Chen, Zhiheng Liu, Yuhao Zhang, Zhiqiang Lv, Liyin Liang, Benyou Wang• 2026

Related benchmarks

Task	Dataset	Result
Audio Understanding	MMSU	Perception Score47.52	37
Acoustic Scene Classification	CochlScene	ACC87.04	17
Audio Understanding	MMAR	--	15
Audio Understanding	MMAU	Perception Score78.64	7

Showing 4 of 4 rows

Other info

Follow for update

@wizwand_team Discord