Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Attention-guided Evidence Grounding for Spoken Question Answering

About

Spoken Question Answering (Spoken QA) presents a challenging cross-modal problem: effectively aligning acoustic queries with textual knowledge while avoiding the latency and error propagation inherent in cascaded ASR-based systems. In this paper, we introduce Attention-guided Evidence Grounding (AEG), a novel end-to-end framework that leverages the internal cross-modal attention of Speech Large Language Models (SpeechLLMs) to explicitly locate and ground key evidence in the model's latent space. To address the diffuse attention distribution in pre-trained models, we propose Learning to Focus on Evidence (LFE), a supervised fine-tuning paradigm that calibrates the model's attention mechanism to distinguish query-relevant segments from irrelevant context. Experiments on SQuAD, HotpotQA, and MuSiQue demonstrate that AEG reduces hallucinations and achieves strong efficiency gains, outperforming large-scale cascaded baselines (Whisper-Large-v3 + Reranker) while reducing inference latency by approximately 62%.

Ke Yang, Bolin Chen, Yuejie Li, Yueying Hua, Jianhao Nie, Yueping He, Bowen Li, Chengjun Mao• 2026

Related benchmarks

TaskDatasetResultRank
Spoken Question AnsweringHotpotQA v1.0 (test)
Exact Match (EM)79.16
12
Spoken Question AnsweringMuSiQue v1.0 (test)
Exact Match (EM)53.99
12
Spoken Question AnsweringSQuAD v1.1 (test)
EM89.24
12
Evidence GroundingSQuAD v1.1 (test)
F1 Score80.02
10
Showing 4 of 4 rows

Other info

Follow for update