Attention-guided Evidence Grounding for Spoken Question Answering

About

Spoken Question Answering (Spoken QA) presents a challenging cross-modal problem: effectively aligning acoustic queries with textual knowledge while avoiding the latency and error propagation inherent in cascaded ASR-based systems. In this paper, we introduce Attention-guided Evidence Grounding (AEG), a novel end-to-end framework that leverages the internal cross-modal attention of Speech Large Language Models (SpeechLLMs) to explicitly locate and ground key evidence in the model's latent space. To address the diffuse attention distribution in pre-trained models, we propose Learning to Focus on Evidence (LFE), a supervised fine-tuning paradigm that calibrates the model's attention mechanism to distinguish query-relevant segments from irrelevant context. Experiments on SQuAD, HotpotQA, and MuSiQue demonstrate that AEG reduces hallucinations and achieves strong efficiency gains, outperforming large-scale cascaded baselines (Whisper-Large-v3 + Reranker) while reducing inference latency by approximately 62%.

Ke Yang, Bolin Chen, Yuejie Li, Yueying Hua, Jianhao Nie, Yueping He, Bowen Li, Chengjun Mao• 2026

Related benchmarks

Task	Dataset	Result
Spoken Question Answering	HotpotQA v1.0 (test)	Exact Match (EM)79.16	12
Spoken Question Answering	MuSiQue v1.0 (test)	Exact Match (EM)53.99	12
Spoken Question Answering	SQuAD v1.1 (test)	EM89.24	12
Evidence Grounding	SQuAD v1.1 (test)	F1 Score80.02	10

Showing 4 of 4 rows

Other info

Follow for update

@wizwand_team Discord