Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Scalable Best-of-N Selection for Large Language Models via Self-Certainty

About

Best-of-N selection is a key technique for improving the reasoning performance of Large Language Models (LLMs) through increased test-time computation. Current state-of-the-art methods often employ computationally intensive reward models for response evaluation and selection. Reward-free alternatives, like self-consistency and universal self-consistency, are limited in their ability to handle open-ended generation tasks or scale effectively. To address these limitations, we propose self-certainty, a novel and efficient metric that leverages the inherent probability distribution of LLM outputs to estimate response quality without requiring external reward models. We hypothesize that higher distributional self-certainty, aggregated across multiple samples, correlates with improved response accuracy, as it reflects greater confidence in the generated output. Through extensive experiments on various reasoning tasks, we demonstrate that self-certainty (1) scales effectively with increasing sample size N, akin to reward models but without the computational overhead; (2) complements chain-of-thought, improving reasoning performance beyond greedy decoding; and (3) generalizes to open-ended tasks where traditional self-consistency methods fall short. Our findings establish self-certainty as a practical and efficient way for improving LLM reasoning capabilities. The code is available at https://github.com/backprop07/Self-Certainty

Zhewei Kang, Xuandong Zhao, Dawn Song• 2025

Related benchmarks

TaskDatasetResultRank
Mathematical ReasoningGSM8K
Accuracy81.7
499
Mathematical ReasoningAIME 2025
Accuracy63
214
Radiology Report GenerationMIMIC-CXR (test)
ROUGE-L0.1974
209
Mathematical ReasoningAMC23
PASS@1 Accuracy60
207
Mathematical ReasoningGSM8K--
204
Mathematical ReasoningHMMT 2025
Accuracy50
194
ReasoningGPQA Diamond
Accuracy51
185
Mathematical ReasoningAIME24
Accuracy87.7
160
Math Word Problem SolvingGSM8K
Accuracy78.5
158
Visual Grounded ReasoningTreeBench
Overall Score48.2
153
Showing 10 of 101 rows
...

Other info

Follow for update