Steer LLM Latents for Hallucination Detection

About

Hallucinations in LLMs pose a significant concern to their safe deployment in real-world applications. Recent approaches have leveraged the latent space of LLMs for hallucination detection, but their embeddings, optimized for linguistic coherence rather than factual accuracy, often fail to clearly separate truthful and hallucinated content. To this end, we propose the Truthfulness Separator Vector (TSV), a lightweight and flexible steering vector that reshapes the LLM's representation space during inference to enhance the separation between truthful and hallucinated outputs, without altering model parameters. Our two-stage framework first trains TSV on a small set of labeled exemplars to form compact and well-separated clusters. It then augments the exemplar set with unlabeled LLM generations, employing an optimal transport-based algorithm for pseudo-labeling combined with a confidence-based filtering process. Extensive experiments demonstrate that TSV achieves state-of-the-art performance with minimal labeled data, exhibiting strong generalization across datasets and providing a practical solution for real-world LLM applications.

Seongheon Park, Xuefeng Du, Min-Hsuan Yeh, Haobo Wang, Yixuan Li• 2025

Related benchmarks

Task	Dataset	Result
Hallucination Detection	TriviaQA	AUROC0.7937	621
Hallucination Detection	HotpotQA	AUROC0.7847	249
Hallucination Detection	CoQA	AUROC0.6845	108
Hallucination Detection	CSQA	AUROC71	107
Hallucination Detection	RAGTruth (test)	AUROC0.8123	99
Hallucination Detection	TruthfulQA	AUROC0.6394	91
Hallucination Detection	MATH	Mean AUROC72	72
Hallucination Detection	TyDiQA-GP	AUC ROC0.7015	46
Hallucination Detection	Dolly AC (test)	AUC75.52	33
Hallucination Detection	NQOpenLike	AUROC71.86	26

Showing 10 of 24 rows

Other info

Follow for update

@wizwand_team Discord