Do Androids Know They're Only Dreaming of Electric Sheep?
About
We design probes trained on the internal representations of a transformer language model to predict its hallucinatory behavior on three grounded generation tasks. To train the probes, we annotate for span-level hallucination on both sampled (organic) and manually edited (synthetic) reference outputs. Our probes are narrowly trained and we find that they are sensitive to their training domain: they generalize poorly from one task to another or from synthetic to organic hallucinations. However, on in-domain data, they can reliably detect hallucinations at many transformer layers, achieving 95% of their peak performance as early as layer 4. Here, probing proves accurate for evaluating hallucination, outperforming several contemporary baselines and even surpassing an expert human annotator in response-level detection F1. Similarly, on span-level labeling, probes are on par or better than the expert annotator on two out of three generation tasks. Overall, we find that probing is a feasible and efficient alternative to language model hallucination evaluation when model states are available.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Uncertainty Quantification | Aggregated Experimental Datasets (XSum, SamSum, CNN, WMT19, MedQUAD, TruthfulQA, CoQA, SciQ, TriviaQA, MMLU, GSM8k) (test) | Mean Rank3.09 | 88 | |
| Claim Verification | 9-dataset aggregate retrieval-free setting (test) | ROC-AUC75 | 70 | |
| Selective Generation | GSM8K | ROC-AUC88.5 | 66 | |
| Mathematical Reasoning | GSM8K | PRR0.71 | 66 | |
| Selective Generation | CoQA | ROC-AUC74.6 | 66 | |
| Selective Generation | MMLU | ROC-AUC0.945 | 66 | |
| Machine Translation | WMT 19 | PRR64 | 66 | |
| Selective Generation | cnn | ROC-AUC72.1 | 66 | |
| Selective Generation | WMT19 | ROC-AUC0.831 | 66 | |
| Selective Generation | TruthfulQA | ROC-AUC0.736 | 66 |