Scalable Best-of-N Selection for Large Language Models via Self-Certainty

About

Best-of-N selection is a key technique for improving the reasoning performance of Large Language Models (LLMs) through increased test-time computation. Current state-of-the-art methods often employ computationally intensive reward models for response evaluation and selection. Reward-free alternatives, like self-consistency and universal self-consistency, are limited in their ability to handle open-ended generation tasks or scale effectively. To address these limitations, we propose self-certainty, a novel and efficient metric that leverages the inherent probability distribution of LLM outputs to estimate response quality without requiring external reward models. We hypothesize that higher distributional self-certainty, aggregated across multiple samples, correlates with improved response accuracy, as it reflects greater confidence in the generated output. Through extensive experiments on various reasoning tasks, we demonstrate that self-certainty (1) scales effectively with increasing sample size N, akin to reward models but without the computational overhead; (2) complements chain-of-thought, improving reasoning performance beyond greedy decoding; and (3) generalizes to open-ended tasks where traditional self-consistency methods fall short. Our findings establish self-certainty as a practical and efficient way for improving LLM reasoning capabilities. The code is available at https://github.com/backprop07/Self-Certainty

Zhewei Kang, Xuandong Zhao, Dawn Song• 2025

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	GSM8K	Accuracy81.7	499
Mathematical Reasoning	AIME 2025	Accuracy63	214
Radiology Report Generation	MIMIC-CXR (test)	ROUGE-L0.1974	209
Mathematical Reasoning	AMC23	PASS@1 Accuracy60	207
Mathematical Reasoning	GSM8K	--	204
Mathematical Reasoning	HMMT 2025	Accuracy50	194
Reasoning	GPQA Diamond	Accuracy51	185
Mathematical Reasoning	AIME24	Accuracy87.7	160
Math Word Problem Solving	GSM8K	Accuracy78.5	158
Visual Grounded Reasoning	TreeBench	Overall Score48.2	153

Showing 10 of 101 rows

...

Other info

Follow for update

@wizwand_team Discord