Entropy Distribution as a Fingerprint for Hallucinations in Generative Models

About

Large Language Models (LLMs) often generate factually incorrect outputs, commonly termed hallucinations, that undermine trust and limit deployment in high-stakes settings. Existing hallucination detection methods typically require multiple forward passes, or access to model internals. In this work, we provide theoretical background and empirical evidence that the distribution of token-level entropies, beyond the mean captured by perplexity or length-normalised entropy, serves as a fingerprint of hallucination, with distributional shape and tail behaviour carrying independent signal. We formalize hallucination detection as a statistical hypothesis test and propose the Calibrated Entropy Score (CES), a lightweight algorithm requiring only a single forward pass and black-box access to token logits. CES combines the mean signal with the maximum signal of the generated entropy through a calibrated reference CDF, producing scores that are directly comparable across models and tasks. We establish finite-sample calibration guarantees via a novel random-length Dvoretzky--Kiefer--Wolfowitz inequality, and also prove that CES detects hallucinations with probability converging to one exponentially fast in the generation length. Across eight QA benchmarks and ten generator models spanning open-source and API access models, CES achieves the highest detection performance among all single-pass black-box methods while providing formal error guarantees that existing heuristics lack. Remarkably, CES is statistically indistinguishable from multi-sample methods that require far greater computational cost, closing the gap between lightweight and expensive detection and making it suitable for real-time, large-scale deployment.

Mattia J. Villani, Pranav Deshpande, Akshay Seshadri, Romina Yalovetzky, Niraj Kumar• 2026

Related benchmarks

Task	Dataset	Result
Hallucination Detection	TriviaQA	--	625
Hallucination Detection	NQ-Open	--	141
Hallucination Detection	CoQA	AUROC0.694	134
Hallucination Detection	GSM8K	--	131
Hallucination Detection	SQuAD	--	127
Hallucination Detection	BioASQ	AUROC0.698	104
Hallucination Detection	SVAMP	--	50
Hallucination Detection	80 Experiments Aggregated (test)	Average Rank6.29	10
Hallucination Detection	DROP	--	2

Showing 9 of 9 rows

Other info

Follow for update

@wizwand_team Discord