No Answer Needed: Predicting LLM Answer Accuracy from Question-Only Linear Probes

About

Do large language models (LLMs) anticipate when they will answer correctly? To study this, we extract activations after a question is read but before any tokens are generated, and train linear probes to predict whether the model's forthcoming answer will be correct. Across three open-source model families ranging from 7 to 70 billion parameters, projections on this "in-advance correctness direction" trained on generic trivia questions predict success in distribution and on diverse out-of-distribution knowledge datasets, indicating a deeper signal than dataset-specific spurious features, and outperforming black-box baselines and verbalised predicted confidence. Predictive power saturates in intermediate layers and, notably, generalisation falters on questions requiring mathematical reasoning. Moreover, for models responding "I don't know", doing so strongly correlates with the probe score, indicating that the same direction also captures confidence. By complementing previous results on truthfulness and other behaviours obtained with probes and sparse auto-encoders, our work contributes essential findings to elucidate LLM internals.

Iv\'an Vicente Moreno Cencerrado, Arnau Padr\'es Masdemont, Anton Gonzalvez Hawthorne, David Demitri Africa, Lorenzo Pacchiardi• 2025

Related benchmarks

Task	Dataset	Result
Correctness Prediction	TriviaQA	AUROC0.826	113
Correctness Prediction	GSM8K	AUROC60.1	33
Correctness Prediction	Notable People	AUROC82.5	18
Correctness Prediction	Cities	AUROC88	18
Correctness Prediction	Medals	AUROC77	18
Correctness Prediction	Math operations	AUROC0.858	18
Factual Question Answering	Factual Category Average (test)	Accuracy31.38	18
Mathematical Reasoning	Math Category Average (test)	Accuracy50.23	18
Code Generation	Code Category Average (test)	Accuracy76.58	18
Confidence Prediction	MATH500 (val)	Spearman Correlation0.46	4

Showing 10 of 10 rows

Other info

Follow for update

@wizwand_team Discord