Where Does Reasoning Break? Step-Level Hallucination Detection via Hidden-State Transport Geometry
About
Large language models hallucinate during multi-step reasoning, but most existing detectors operate at the trace level: they assign one confidence score to a full output, fail to localize the first error, and often require multiple sampled completions. We frame hallucination instead as a property of the hidden-state trajectory produced during a single forward pass. Correct reasoning moves through a stable manifold of locally coherent transitions; a first error appears as a localized excursion in transport cost away from this manifold. We operationalize this view with a label-conditioned teacher that builds a trace-specific contrastive PCA lens and scores each step with seven geometric transition features, and a deployable BiLSTM student distilled from the teacher that operates on raw hidden states without inference-time labels. We prove that contrastive PCA is the optimal projection for a transport-separation objective between first error and correct states, and that single-pass first error localization holds whenever the first error creates a positive transport margin over preceding correct transitions. On ProcessBench, PRM800K, HaluEval, and TruthfulQA, both models outperform entropy-based, probing-based, and attention-based baselines in-domain; the teacher transfers stably across language models and datasets, while the student collapses under shift, a gap our distillation theory predicts. These results recast step-level hallucination detection as a problem of trajectory dynamics and identify the central obstacle to deployment: preserving the contrastive transport margin under distribution shift.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Hallucination Detection | HaluEval | AUROC0.94 | 131 | |
| First-error detection | PROCESSBENCH | Accuracy68.7 | 6 | |
| First-error detection | PRM800K | Accuracy92.9 | 6 | |
| First-error detection | TruthfulQA | Accuracy96.8 | 6 | |
| Step-level hallucination detection | PROCESSBENCH | AUROC91 | 6 | |
| Step-level hallucination detection | PRM800K | AUROC99.8 | 6 | |
| Step-level hallucination detection | TruthfulQA | AUROC0.965 | 6 |