HaluNet: Learning Hallucination Risk from Internal Signals in LLM Question Answering

About

Large language models (LLMs) achieve strong question answering (QA) performance but can produce fluent answers unsupported by available evidence. Existing hallucination detectors often rely on external verification, repeated sampling, or test-time judge calls, which can be costly for real-time QA. We propose \textbf{HaluNet}, a lightweight hallucination risk estimator that uses internal signals from one model generation. HaluNet jointly models token likelihood, predictive entropy, and hidden-state information, allowing probabilistic, distributional, and semantic evidence to inform an answer-level risk score. It is trained with LLM-as-a-Judge labels as scalable weak supervision and evaluated with independent human and multi-judge assessments. Experiments on SQuAD, TriviaQA, and Natural Questions show that HaluNet improves answer-level risk ranking across in-domain and out-of-domain settings. On a 300-example human evaluation, HaluNet achieves 0.874 AUROC and 0.869 AUPRC; its top 20\% highest-risk answers contain 96.5\% errors, yielding a 2.06$\times$ lift over the base error rate.

Chaodong Tong, Qi Zhang, Zhuojun Jiang, Lei Jiang, Yanbing Liu• 2025

Related benchmarks

Task	Dataset	Result	Rank
Hallucination Detection	TriviaQA	AUROC0.893		625
Hallucination Detection	SQuAD	ROC83.9		18

Showing 2 of 2 rows

Other info

Follow for update

@wizwand_team Discord