Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Do Androids Know They're Only Dreaming of Electric Sheep?

About

We design probes trained on the internal representations of a transformer language model to predict its hallucinatory behavior on three grounded generation tasks. To train the probes, we annotate for span-level hallucination on both sampled (organic) and manually edited (synthetic) reference outputs. Our probes are narrowly trained and we find that they are sensitive to their training domain: they generalize poorly from one task to another or from synthetic to organic hallucinations. However, on in-domain data, they can reliably detect hallucinations at many transformer layers, achieving 95% of their peak performance as early as layer 4. Here, probing proves accurate for evaluating hallucination, outperforming several contemporary baselines and even surpassing an expert human annotator in response-level detection F1. Similarly, on span-level labeling, probes are on par or better than the expert annotator on two out of three generation tasks. Overall, we find that probing is a feasible and efficient alternative to language model hallucination evaluation when model states are available.

Sky CH-Wang, Benjamin Van Durme, Jason Eisner, Chris Kedzie• 2023

Related benchmarks

TaskDatasetResultRank
Claim Verification9-dataset aggregate retrieval-free setting (test)
ROC-AUC75
70
Misclassification DetectionCOLA
ROC-AUC61.5
31
Hallucination DetectionCDM (test)
F1 Score75
16
Uncertainty EstimationAggregate (Cola, GEmot, IMDB, News, SST5, Toxigen, YELP)
ECE10.3
13
Hallucination DetectionCF (test)
F1 Score94
10
Hallucination DetectionE2E (test)
F1-R90
10
Span-level classificationCDM (test)
F1-Sp0.55
6
Span-level classificationE2E (test)
F1 Score (Span)56
6
Span-level classificationConv-FEVER (CF) (test)
F1 Score (Spans)0.81
6
Showing 9 of 9 rows

Other info

Code

Follow for update