ReProbe: Efficient Test-Time Scaling of Multi-Step Reasoning by Probing Internal States of Large Language Models

About

LLMs can solve complex tasks by generating long, multi-step reasoning chains. Test-time scaling (TTS) can further improve performance by sampling multiple variants of intermediate reasoning steps, verifying their correctness, and selecting the best steps for continuation. However, existing verification approaches, such as Process Reward Models (PRMs), are computationally expensive and require large-scale human or model-generated annotations. We propose a lightweight alternative for step-level reasoning verification based on probing the internal states of LLMs. We train a transformer-based probe that uses the internal states of a frozen LLM to estimate the credibility of its reasoning steps during generation. Annotation can be provided either by a larger LLM (e.g., DeepSeek-R1) or in a self-supervised manner by the original model itself. The probes are lightweight, containing fewer than 10M parameters. Across multiple domains, including mathematics, planning, and general knowledge question answering, our probes match or exceed the performance of PRMs that are up to 810x larger. These results suggest that LLM internal states encode confidence in their reasoning processes and can serve as reliable signals for step verification, offering a promising path toward scalable, generalizable TTS and more introspective LLMs.

Jingwei Ni, Ekaterina Fadeeva, Tianyi Wu, Mubashara Akhtar, Jiaheng Zhang, Elliott Ash, Markus Leippold, Timothy Baldwin, See-Kiong Ng, Artem Shelmanov, Mrinmaya Sachan• 2025

Related benchmarks

Task	Dataset	Result
Science Question Answering	ScienceQA	Accuracy95.7	916
Mathematical Reasoning	GSM8K (test)	Accuracy98.8	816
Mathematical Reasoning	GSM8K	Accuracy100	388
Science Question Answering	ScienceQA (test)	Average Accuracy96.7	273
Question Answering	StrategyQA (test)	Task Accuracy96.7	74
Reasoning	MATH	--	46
Mathematical Reasoning	Math ID GSM8k ProofNet	GSM8k Accuracy97.8	28
Question Answering	QA OOD StrQA SciQA	StrQA Accuracy88.6	28
Reasoning Question Answering	StrategyQA	Accuracy0.93	26
Step-level correctness assessment	Trips (test)	PR-AUC79.9	22

Showing 10 of 34 rows

Other info

Follow for update

@wizwand_team Discord