Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

ReProbe: Efficient Test-Time Scaling of Multi-Step Reasoning by Probing Internal States of Large Language Models

About

LLMs can solve complex tasks by generating long, multi-step reasoning chains. Test-time scaling (TTS) can further improve performance by sampling multiple variants of intermediate reasoning steps, verifying their correctness, and selecting the best steps for continuation. However, existing verification approaches, such as Process Reward Models (PRMs), are computationally expensive and require large-scale human or model-generated annotations. We propose a lightweight alternative for step-level reasoning verification based on probing the internal states of LLMs. We train a transformer-based probe that uses the internal states of a frozen LLM to estimate the credibility of its reasoning steps during generation. Annotation can be provided either by a larger LLM (e.g., DeepSeek-R1) or in a self-supervised manner by the original model itself. The probes are lightweight, containing fewer than 10M parameters. Across multiple domains, including mathematics, planning, and general knowledge question answering, our probes match or exceed the performance of PRMs that are up to 810x larger. These results suggest that LLM internal states encode confidence in their reasoning processes and can serve as reliable signals for step verification, offering a promising path toward scalable, generalizable TTS and more introspective LLMs.

Jingwei Ni, Ekaterina Fadeeva, Tianyi Wu, Mubashara Akhtar, Jiaheng Zhang, Elliott Ash, Markus Leippold, Timothy Baldwin, See-Kiong Ng, Artem Shelmanov, Mrinmaya Sachan• 2025

Related benchmarks

TaskDatasetResultRank
Mathematical ReasoningGSM8K (test)
Accuracy98.8
816
Science Question AnsweringScienceQA
Accuracy95.7
791
Mathematical ReasoningGSM8K
Accuracy100
388
Science Question AnsweringScienceQA (test)
Average Accuracy96.7
273
Question AnsweringStrategyQA (test)
Task Accuracy96.7
74
ReasoningMATH--
46
Mathematical ReasoningMath ID GSM8k ProofNet
GSM8k Accuracy97.8
28
Question AnsweringQA OOD StrQA SciQA
StrQA Accuracy88.6
28
Reasoning Question AnsweringStrategyQA
Accuracy0.93
26
Step-level correctness assessmentTrips (test)
PR-AUC79.9
22
Showing 10 of 34 rows

Other info

Follow for update