Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Pass@k Training for Adaptively Balancing Exploration and Exploitation of Large Reasoning Models

About

Reinforcement learning with verifiable rewards (RLVR), which typically adopts Pass@1 as the reward, has faced the issues in balancing exploration and exploitation, causing policies to prefer conservative actions, converging to a local optimum. Identifying an appropriate reward metric is therefore crucial. Regarding the prior work, although Pass@k has been used in evaluation, its connection to LLM exploration ability in RLVR remains largely overlooked. To investigate this, we first use Pass@k as the reward to train the policy model (i.e., $\textbf{Pass@k Training}$), and observe the improvement on its exploration ability. Next, we derive an analytical solution for the advantage of Pass@k Training, leading to an efficient and effective process. Building on this, our analysis reveals that exploration and exploitation are not inherently conflicting objectives, while they can mutually enhance each other. Moreover, Pass@k Training with analytical derivation essentially involves directly designing the advantage function. Inspired by this, we preliminarily explore the advantage design for RLVR, showing promising results and highlighting a potential future direction.

Zhipeng Chen, Xiaobo Qin, Youbin Wu, Yue Ling, Qinghao Ye, Wayne Xin Zhao, Guang Shi• 2025

Related benchmarks

TaskDatasetResultRank
Mathematical ReasoningIn-Distribution Reasoning Performance Suite (AIME, AMC, MATH-500, Minerva, Olympiad)
AIME 2024 Score20.9
97
Mathematical ReasoningAIME 2025
Avg@10 Accuracy31
15
Mathematical ReasoningOlympiadBench
Avg@5 Accuracy55.06
15
Mathematical ReasoningAIME 2024
Avg@10 Accuracy27.67
15
Mathematical ReasoningAMC 2023
Avg@10 Accuracy74
15
Mathematical ReasoningMATH-500 (val)
Avg@5 Accuracy86.33
15
Formal Theorem ProvingLean (test)
Pass@154.8
14
Mathematical ReasoningOut-of-Domain Mathematical Reasoning Suite ARC-c, GPQA, MMLU-Pro
ARC-c Score79.3
10
Mathematical ReasoningAIME24
Average Rank2.73
5
Mathematical ReasoningAIME 25
Average Rank2.7
5
Showing 10 of 17 rows

Other info

Follow for update