CoRefine: Confidence-Guided Self-Refinement for Adaptive Test-Time Compute

About

Large Language Models (LLMs) often rely on test-time scaling via parallel decoding (for example, 512 samples) to boost reasoning accuracy, but this incurs substantial compute. We introduce CoRefine, a confidence-guided self-refinement method that achieves competitive accuracy using a fraction of the tokens via a lightweight 211k-parameter Conv1D controller atop a frozen LLM. The controller consumes full-trace confidence to decide whether to halt, re-examine, or try a different approach, enabling targeted self-correction with an average of 2.7 refinement steps per problem and roughly 190-fold token reduction relative to 512-sample baselines. Across diverse reasoning benchmarks and three open-source models, the controller achieves 92.6 percent precision when it confidently halts, indicating that confidence dynamics reliably signal correctness without ground-truth verification. We extend this to CoRefine-Tree, a hybrid sequential-parallel variant that adaptively balances exploration and exploitation, with easy serving integration and verifier compatibility. By treating confidence as a control signal rather than a correctness guarantee, CoRefine provides a modular primitive for scalable reasoning and agentic settings with imperfect verifiers.

Chen Jin, Ryutaro Tanno, Tom Diethe, Philip Teare• 2026

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	AIME 25	Accuracy87.3	201
Mathematical Reasoning	AIME24	Accuracy90.7	160
Mathematical Reasoning	HMMT25	Accuracy83.3	119
Mathematical Reasoning	AIME 24	AIME 24 Accuracy90.7	84
Mathematical Reasoning	BRUMO25	Accuracy92.6	62
Mathematical Reasoning	AIME25	Accuracy87.3	41

Showing 6 of 6 rows

Other info

Follow for update

@wizwand_team Discord