The Best of N Worlds: Aligning Reinforcement Learning with Best-of-N Sampling via max@k Optimisation

About

The application of Reinforcement Learning with Verifiable Rewards (RLVR) to mathematical and coding domains has demonstrated significant improvements in the reasoning and problem-solving abilities of Large Language Models. Despite its success in single generation problem solving, the reinforcement learning fine-tuning process may harm the model's exploration ability, as reflected in decreased diversity of generations and a resulting degradation of performance during Best-of-N sampling for large N values. In this work, we focus on optimizing the max@k metric, a continuous generalization of pass@k. We derive an unbiased on-policy gradient estimate for direct optimization of this metric. Furthermore, we extend our derivations to the off-policy updates, a common element in modern RLVR algorithms, that allows better sample efficiency. Empirically, we show that our objective effectively optimizes max@k metric in off-policy scenarios, aligning the model with the Best-of-N inference strategy.

Farid Bagirov, Mikhail Arkhipov, Ksenia Sycheva, Evgeniy Glukhov, Egor Bogomolov• 2025

Related benchmarks

Task	Dataset	Result
Instruction Following	UltraFeedback (core250)	Delta Preference Score (bo64)11.304	15
Function Calling	ToolRL 80-prompt (held-out)	Best@394	8
Maze Navigation	Maze 100 held-out mazes	Best Success Rate @ 352.6	8
Multi-hop Question Answering	MuSiQue 300-question hop-stratified (held-out)	Best@375.7	8
Chain-of-Thought Reasoning	EUREQA (held-out half of hard_5)	Best@320.6	8

Showing 5 of 5 rows

Other info

Follow for update

@wizwand_team Discord