Soft Best-of-n Sampling for Model Alignment
About
Best-of-$n$ (BoN) sampling is a practical approach for aligning language model outputs with human preferences without expensive fine-tuning. BoN sampling is performed by generating $n$ responses to a prompt and then selecting the sample that maximizes a reward function. BoN yields high reward values in practice at a distortion cost, as measured by the KL-divergence between the sampled and original distribution. This distortion is coarsely controlled by varying the number of samples: larger $n$ yields a higher reward at a higher distortion cost. We introduce Soft Best-of-$n$ sampling, a generalization of BoN that allows for smooth interpolation between the original distribution and reward-maximizing distribution through a temperature parameter $\lambda$. We establish theoretical guarantees showing that Soft Best-of-$n$ sampling converges sharply to the optimal tilted distribution at a rate of $O(1/n)$ in KL and the expected (relative) reward. For sequences of discrete outputs, we analyze an additive reward model that reveals the fundamental limitations of blockwise sampling.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Mathematical Reasoning | MATH 500 | Accuracy (Acc)84.1 | 543 | |
| Multi-task Language Understanding | MMLU | MMLU Accuracy81.7 | 442 | |
| Multitask Language Understanding | MMLU | Accuracy57.6 | 263 | |
| Mathematical Reasoning | OlympiadBench | Accuracy42.5 | 213 | |
| Grade School Math Reasoning | GSM8K | Accuracy (GSM8K)95.6 | 138 | |
| General Reasoning | Average (MATH500, OlympiadBench, Minerva, MMLU, GSM8K) | Average Accuracy70 | 20 | |
| Locomotion Control (Cheetah) | Cheetah 3000 episodes | Return (IQM)264.8 | 4 | |
| Locomotion Control (Quadruped) | Quadruped (6000 episodes) | Return (IQM)398.1 | 4 | |
| Locomotion Control (Walker) | Walker (3000 episodes) | Return (IQR)290.3 | 4 |