Soft Best-of-n Sampling for Model Alignment

About

Best-of-$n$ (BoN) sampling is a practical approach for aligning language model outputs with human preferences without expensive fine-tuning. BoN sampling is performed by generating $n$ responses to a prompt and then selecting the sample that maximizes a reward function. BoN yields high reward values in practice at a distortion cost, as measured by the KL-divergence between the sampled and original distribution. This distortion is coarsely controlled by varying the number of samples: larger $n$ yields a higher reward at a higher distortion cost. We introduce Soft Best-of-$n$ sampling, a generalization of BoN that allows for smooth interpolation between the original distribution and reward-maximizing distribution through a temperature parameter $\lambda$. We establish theoretical guarantees showing that Soft Best-of-$n$ sampling converges sharply to the optimal tilted distribution at a rate of $O(1/n)$ in KL and the expected (relative) reward. For sequences of discrete outputs, we analyze an additive reward model that reveals the fundamental limitations of blockwise sampling.

Claudio Mayrink Verdun, Alex Oesterling, Himabindu Lakkaraju, Flavio P. Calmon• 2025

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	MATH 500	Accuracy (Acc)84.1	600
Mathematical Reasoning	AIME 2024	Accuracy26.7	525
Multi-task Language Understanding	MMLU	MMLU Accuracy81.7	456
Multitask Language Understanding	MMLU	Accuracy57.6	263
Mathematical Reasoning	OlympiadBench	Accuracy42.5	213
Grade School Math Reasoning	GSM8K	Accuracy (GSM8K)95.6	186
General Reasoning	Average (MATH500, OlympiadBench, Minerva, MMLU, GSM8K)	Average Accuracy70	20
Locomotion Control (Cheetah)	Cheetah 3000 episodes	Return (IQM)264.8	4
Locomotion Control (Quadruped)	Quadruped (6000 episodes)	Return (IQM)398.1	4
Locomotion Control (Walker)	Walker (3000 episodes)	Return (IQR)290.3	4

Showing 10 of 10 rows

Other info

Follow for update

@wizwand_team Discord