Harnessing Reasoning Trajectories for Hallucination Detection via Answer-agreement Representation Shaping

About

Large reasoning models (LRMs) often generate long, seemingly coherent reasoning traces yet still produce incorrect answers, making hallucination detection challenging. Although trajectories contain useful signals, directly using trace text or vanilla hidden states for detection is brittle: traces vary in form and detectors can overfit to superficial patterns rather than answer validity. We introduce Answer-agreement Representation Shaping (ARS), which learns detection-friendly trace-conditioned representations by explicitly encoding answer stability. ARS generates counterfactual answers through small latent interventions, specifically, perturbing the trace-boundary embedding, and labels each perturbation by whether the resulting answer agrees with the original. It then learns representations that bring answer-agreeing states together and separate answer-disagreeing ones, exposing latent instability indicative of hallucination risk. The shaped embeddings are plug-and-play with existing embedding-based detectors and require no human annotations during training. Experiments demonstrate that ARS consistently improves detection and achieves substantial gains over strong baselines. Code is available at: https://github.com/radiolab-ntu/ars_icml2026.

Jianxiong Zhang, Bing Guo, Yuming Jiang, Haobo Wang, Bo An, Sean Du• 2026

Related benchmarks

Task	Dataset	Result
Hallucination Detection	TriviaQA	AUROC0.9162	621
Hallucination Detection	TruthfulQA	AUC (ROC)0.9417	178
Hallucination Detection	GSM8K	AUROC90.37	115
Hallucination Detection	MATH 500	AUROC88	37
Hallucination Detection	TruthfulQA 25% random (test)	AUROC0.8772	11
Hallucination Detection	MATH-500 25% random (test)	AUROC0.7943	11

Showing 6 of 6 rows

Other info

Follow for update

@wizwand_team Discord