What should post-training optimize? A test-time scaling law perspective
About
Large language models are increasingly deployed with test-time strategies: sample $N$ responses, score them with a reward model or verifier, and return the best. This deployment rule exposes a mismatch in post-training: standard objectives optimize the mean reward of a single response, whereas best-of-$N$ performance is governed by the upper tail of the reward distribution. Recent test-time-aware objectives partly address this mismatch, but typically assume that training can use the same per-prompt rollout budget as deployment, which is impractical when post-training must cover many prompts while deployment can allocate much larger per-prompt test-time compute. We study this budget-mismatch regime, where only $m\ll N$ per-prompt rollouts are available during training but the target objective is best-of-$N$ deployment. Under structural assumptions on the reward tails, we show that the policy gradient of the best-of-$N$ objective can be approximated from a much smaller rollout group by extrapolating upper-tail statistics. This yields a family of Tail-Extrapolated estimators for best-of-$N$-oriented post-training: a simple direct estimator, Tail-Extrapolated Advantage (TEA), and a fixed-order debiased Prefix-TEA estimator based on moment cancellation. Experiments on instruction-following tasks show that TEA and Prefix-TEA improve best-of-$N$ performance across different language models, reward models and datasets under various training and test-time budget settings.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Best-of-N Reward Evaluation | UltraFeedback (core250) | Reward Score24.323 | 18 | |
| Helpful Dialogue | Anthropic HH-RLHF helpful core250 (test) | Reward Score18.93 | 18 | |
| Reward Modeling | UltraFeedback core250 (held-out evaluation) | Delta (Δ)3.543 | 18 | |
| Instruction Following | UltraFeedback (core250) | Delta Preference Score (bo64)12.568 | 15 | |
| Pairwise Judge Comparison | UltraFeedback (core250) | Win Count (W)161 | 14 | |
| Preference Evaluation | UltraFeedback core250 (test) | Win Rate80 | 12 | |
| Reward Modeling | HH-RLHF helpful core250 (held-out evaluation) | Reward Score20.155 | 12 | |
| Reward Modeling | Anthropic/hh-rlhf HH-helpful core250 | Delta RM0.292 | 6 | |
| Reward Modeling | UltraFeedback core500 (held-out) | bo1 Score0.467 | 4 | |
| Reward Modeling | UltraFeedback core250 (test) | Reward Score Difference (TEA vs GRPO)1.103 | 4 |