Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Attention Sinks as Internal Signals for Hallucination Detection in Large Language Models

About

Large language models frequently exhibit hallucinations: fluent and confident outputs that are factually incorrect or unsupported by the input context. While recent hallucination detection methods have explored various features derived from attention maps, the underlying mechanisms they exploit remain poorly understood. In this work, we propose SinkProbe, a hallucination detection method grounded in the observation that hallucinations are deeply entangled with attention sinks - tokens that accumulate disproportionate attention mass during generation - indicating a transition from distributed, input-grounded attention to compressed, prior-dominated computation. Importantly, although sink scores are computed solely from attention maps, we find that the classifier preferentially relies on sinks whose associated value vectors have large norms. Moreover, we show that previous methods implicitly depend on attention sinks by establishing their mathematical relationship to sink scores. Our findings yield a novel hallucination detection method grounded in theory that produces state-of-the-art results across popular datasets and LLMs.

Jakub Binkowski, Kamil Adamczewski, Tomasz Kajdanowicz• 2026

Related benchmarks

TaskDatasetResultRank
Hallucination DetectionTriviaQA
AUROC0.883
438
Hallucination DetectionTruthfulQA
AUC (ROC)0.821
102
Hallucination DetectionGSM8K
AUROC85.4
93
Hallucination DetectionNQ-Open
AUROC0.821
61
Hallucination DetectionHaluEvalQA
ROC-AUC89
28
Hallucination DetectionSQuAD v2
ROC-AUC0.798
28
Hallucination DetectionUMWP
ROC-AUC89.6
28
Showing 7 of 7 rows

Other info

Follow for update