Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

The Best of N Worlds: Aligning Reinforcement Learning with Best-of-N Sampling via max@k Optimisation

About

The application of Reinforcement Learning with Verifiable Rewards (RLVR) to mathematical and coding domains has demonstrated significant improvements in the reasoning and problem-solving abilities of Large Language Models. Despite its success in single generation problem solving, the reinforcement learning fine-tuning process may harm the model's exploration ability, as reflected in decreased diversity of generations and a resulting degradation of performance during Best-of-N sampling for large N values. In this work, we focus on optimizing the max@k metric, a continuous generalization of pass@k. We derive an unbiased on-policy gradient estimate for direct optimization of this metric. Furthermore, we extend our derivations to the off-policy updates, a common element in modern RLVR algorithms, that allows better sample efficiency. Empirically, we show that our objective effectively optimizes max@k metric in off-policy scenarios, aligning the model with the Best-of-N inference strategy.

Farid Bagirov, Mikhail Arkhipov, Ksenia Sycheva, Evgeniy Glukhov, Egor Bogomolov• 2025

Related benchmarks

TaskDatasetResultRank
Instruction FollowingUltraFeedback (core250)
Delta Preference Score (bo64)11.304
15
Function CallingToolRL 80-prompt (held-out)
Best@394
8
Maze NavigationMaze 100 held-out mazes
Best Success Rate @ 352.6
8
Multi-hop Question AnsweringMuSiQue 300-question hop-stratified (held-out)
Best@375.7
8
Chain-of-Thought ReasoningEUREQA (held-out half of hard_5)
Best@320.6
8
Showing 5 of 5 rows

Other info

Follow for update