Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Scalable Best-of-N Selection for Large Language Models via Self-Certainty

About

Best-of-N selection is a key technique for improving the reasoning performance of Large Language Models (LLMs) through increased test-time computation. Current state-of-the-art methods often employ computationally intensive reward models for response evaluation and selection. Reward-free alternatives, like self-consistency and universal self-consistency, are limited in their ability to handle open-ended generation tasks or scale effectively. To address these limitations, we propose self-certainty, a novel and efficient metric that leverages the inherent probability distribution of LLM outputs to estimate response quality without requiring external reward models. We hypothesize that higher distributional self-certainty, aggregated across multiple samples, correlates with improved response accuracy, as it reflects greater confidence in the generated output. Through extensive experiments on various reasoning tasks, we demonstrate that self-certainty (1) scales effectively with increasing sample size N, akin to reward models but without the computational overhead; (2) complements chain-of-thought, improving reasoning performance beyond greedy decoding; and (3) generalizes to open-ended tasks where traditional self-consistency methods fall short. Our findings establish self-certainty as a practical and efficient way for improving LLM reasoning capabilities. The code is available at https://github.com/backprop07/Self-Certainty

Zhewei Kang, Xuandong Zhao, Dawn Song• 2025

Related benchmarks

TaskDatasetResultRank
Mathematical ReasoningGSM8K
Accuracy81.7
351
Mathematical ReasoningAIME24
Accuracy87.7
130
Mathematical ReasoningHMMT25
Accuracy78
78
Mathematical ReasoningMATH500
Accuracy38.3
57
Mathematical ReasoningAIME 24
Accuracy91.3
35
Scientific ReasoningGPQA Diamond--
28
Mathematical ReasoningAIME 25
Accuracy87.3
12
Science Question AnsweringARC Challenge
Accuracy66.4
10
Science Question AnsweringARC Easy
Accuracy84.5
10
Hard LLM ReasoningHLE
Accuracy2.5
10
Showing 10 of 22 rows

Other info

Follow for update