What should post-training optimize? A test-time scaling law perspective

About

Large language models are increasingly deployed with test-time strategies: sample $N$ responses, score them with a reward model or verifier, and return the best. This deployment rule exposes a mismatch in post-training: standard objectives optimize the mean reward of a single response, whereas best-of-$N$ performance is governed by the upper tail of the reward distribution. Recent test-time-aware objectives partly address this mismatch, but typically assume that training can use the same per-prompt rollout budget as deployment, which is impractical when post-training must cover many prompts while deployment can allocate much larger per-prompt test-time compute. We study this budget-mismatch regime, where only $m\ll N$ per-prompt rollouts are available during training but the target objective is best-of-$N$ deployment. Under structural assumptions on the reward tails, we show that the policy gradient of the best-of-$N$ objective can be approximated from a much smaller rollout group by extrapolating upper-tail statistics. This yields a family of Tail-Extrapolated estimators for best-of-$N$-oriented post-training: a simple direct estimator, Tail-Extrapolated Advantage (TEA), and a fixed-order debiased Prefix-TEA estimator based on moment cancellation. Experiments on instruction-following tasks show that TEA and Prefix-TEA improve best-of-$N$ performance across different language models, reward models and datasets under various training and test-time budget settings.

Muheng Li, Jian Qian, Wenlong Mou• 2026

Related benchmarks

Task	Dataset	Result
Best-of-N Reward Evaluation	UltraFeedback (core250)	Reward Score24.323	18
Helpful Dialogue	Anthropic HH-RLHF helpful core250 (test)	Reward Score18.93	18
Reward Modeling	UltraFeedback core250 (held-out evaluation)	Delta (Δ)3.543	18
Instruction Following	UltraFeedback (core250)	Delta Preference Score (bo64)12.568	15
Pairwise Judge Comparison	UltraFeedback (core250)	Win Count (W)161	14
Preference Evaluation	UltraFeedback core250 (test)	Win Rate80	12
Reward Modeling	HH-RLHF helpful core250 (held-out evaluation)	Reward Score20.155	12
Reward Modeling	Anthropic/hh-rlhf HH-helpful core250	Delta RM0.292	6
Reward Modeling	UltraFeedback core500 (held-out)	bo1 Score0.467	4
Reward Modeling	UltraFeedback core250 (test)	Reward Score Difference (TEA vs GRPO)1.103	4

Showing 10 of 12 rows

Other info

Follow for update

@wizwand_team Discord