Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Steer LLM Latents for Hallucination Detection

About

Hallucinations in LLMs pose a significant concern to their safe deployment in real-world applications. Recent approaches have leveraged the latent space of LLMs for hallucination detection, but their embeddings, optimized for linguistic coherence rather than factual accuracy, often fail to clearly separate truthful and hallucinated content. To this end, we propose the Truthfulness Separator Vector (TSV), a lightweight and flexible steering vector that reshapes the LLM's representation space during inference to enhance the separation between truthful and hallucinated outputs, without altering model parameters. Our two-stage framework first trains TSV on a small set of labeled exemplars to form compact and well-separated clusters. It then augments the exemplar set with unlabeled LLM generations, employing an optimal transport-based algorithm for pseudo-labeling combined with a confidence-based filtering process. Extensive experiments demonstrate that TSV achieves state-of-the-art performance with minimal labeled data, exhibiting strong generalization across datasets and providing a practical solution for real-world LLM applications.

Seongheon Park, Xuefeng Du, Min-Hsuan Yeh, Haobo Wang, Yixuan Li• 2025

Related benchmarks

TaskDatasetResultRank
Hallucination DetectionTriviaQA
AUROC0.5
265
Hallucination DetectionHotpotQA
AUROC0.55
118
Hallucination DetectionRAGTruth (test)
AUROC0.8123
83
Hallucination DetectionMATH
Mean AUROC72
72
Hallucination DetectionCSQA
AUROC71
55
Hallucination DetectionDolly AC (test)
AUC75.52
33
Hallucination DetectionRAGTruth LLaMA2-13B
Recall80.68
19
Hallucination DetectionDolly AC LLaMA2-7B
Recall87.28
19
Hallucination DetectionDolly AC LLaMA3-8B
Recall64.67
19
Hallucination DetectionRAGTruth LLaMA2-7B
Recall0.5526
19
Showing 10 of 12 rows

Other info

Follow for update