INSIDE: LLMs' Internal States Retain the Power of Hallucination Detection

About

Knowledge hallucination have raised widespread concerns for the security and reliability of deployed LLMs. Previous efforts in detecting hallucinations have been employed at logit-level uncertainty estimation or language-level self-consistency evaluation, where the semantic information is inevitably lost during the token-decoding procedure. Thus, we propose to explore the dense semantic information retained within LLMs' \textbf{IN}ternal \textbf{S}tates for halluc\textbf{I}nation \textbf{DE}tection (\textbf{INSIDE}). In particular, a simple yet effective \textbf{EigenScore} metric is proposed to better evaluate responses' self-consistency, which exploits the eigenvalues of responses' covariance matrix to measure the semantic consistency/diversity in the dense embedding space. Furthermore, from the perspective of self-consistent hallucination detection, a test time feature clipping approach is explored to truncate extreme activations in the internal states, which reduces overconfident generations and potentially benefits the detection of overconfident hallucinations. Extensive experiments and ablation studies are performed on several popular LLMs and question-answering (QA) benchmarks, showing the effectiveness of our proposal.

Chao Chen, Kai Liu, Ze Chen, Yi Gu, Yue Wu, Mingyuan Tao, Zhihang Fu, Jieping Ye• 2024

Related benchmarks

Task	Dataset	Result
Hallucination Detection	TriviaQA	AUROC0.8174	625
Hallucination Detection	HotpotQA	AUROC0.9025	294
Hallucination Detection	TriviaQA (test)	AUC-ROC82.6	255
Hallucination Detection	NQ	AUC0.757	199
Hallucination Detection	TruthfulQA	AUC (ROC)0.59	182
Hallucination Detection	HaluEval (test)	AUC-ROC76.9	176
Knowledge	MMLU	Accuracy47.6	171
Hallucination Detection	NQ-Open	AUROC0.6088	141
Hallucination Detection	HaluEval	AUROC0.684	135
Hallucination Detection	CoQA	AUROC70.92	134

Showing 10 of 211 rows

...

Other info

Follow for update

@wizwand_team Discord