Attention-guided Evidence Grounding for Spoken Question Answering
About
Spoken Question Answering (Spoken QA) presents a challenging cross-modal problem: effectively aligning acoustic queries with textual knowledge while avoiding the latency and error propagation inherent in cascaded ASR-based systems. In this paper, we introduce Attention-guided Evidence Grounding (AEG), a novel end-to-end framework that leverages the internal cross-modal attention of Speech Large Language Models (SpeechLLMs) to explicitly locate and ground key evidence in the model's latent space. To address the diffuse attention distribution in pre-trained models, we propose Learning to Focus on Evidence (LFE), a supervised fine-tuning paradigm that calibrates the model's attention mechanism to distinguish query-relevant segments from irrelevant context. Experiments on SQuAD, HotpotQA, and MuSiQue demonstrate that AEG reduces hallucinations and achieves strong efficiency gains, outperforming large-scale cascaded baselines (Whisper-Large-v3 + Reranker) while reducing inference latency by approximately 62%.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Spoken Question Answering | HotpotQA v1.0 (test) | Exact Match (EM)79.16 | 12 | |
| Spoken Question Answering | MuSiQue v1.0 (test) | Exact Match (EM)53.99 | 12 | |
| Spoken Question Answering | SQuAD v1.1 (test) | EM89.24 | 12 | |
| Evidence Grounding | SQuAD v1.1 (test) | F1 Score80.02 | 10 |