On-Policy RL with Optimal Reward Baseline

About

Reinforcement learning algorithms are fundamental to align large language models with human preferences and to enhance their reasoning capabilities. However, current reinforcement learning algorithms often suffer from training instability due to loose on-policy constraints and computational inefficiency due to auxiliary models. In this work, we propose On-Policy RL with Optimal reward baseline (OPO), a novel and simplified reinforcement learning algorithm designed to address these challenges. OPO emphasizes the importance of exact on-policy training, which empirically stabilizes the training process and enhances exploration. Moreover, OPO integrates a practically feasible formulation of the optimal reward baseline that minimizes gradient variance. We evaluate OPO on mathematical reasoning benchmarks. The results demonstrate its superior performance and training stability without additional models or regularization terms. Furthermore, OPO achieves lower policy shifts and higher output entropy, encouraging more diverse and less repetitive responses. These results highlight OPO as a promising direction for stable and effective reinforcement learning in large language model alignment and reasoning tasks. The implementation is merged into the verl library at https://verl.readthedocs.io/en/latest/algo/opo.html.

Yaru Hao, Li Dong, Xun Wu, Shaohan Huang, Zewen Chi, Furu Wei• 2025

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	MATH 500	Top-1 Accuracy82.2	384
Mathematical Reasoning	Minerva	Pass@1 Accuracy31.6	289
Mathematical Reasoning	Minerva Math	Accuracy38.6	233
Mathematical Reasoning	OlympiadBench	Accuracy41	213
Mathematical Reasoning	AMC23	PASS@1 Accuracy71.5	207
Mathematical Reasoning	AIME 25	Pass@1 Accuracy8.4	178
Mathematical Reasoning	Minerva	Accuracy (Acc)26.1	146
Mathematical Reasoning	Olympiad	Accuracy0.403	134
Mathematical Reasoning	Minerva Math	pass@1 Accuracy38.2	104
Mathematical Reasoning	Minerva	Avg@1628.5	42

Showing 10 of 28 rows

Other info

Follow for update

@wizwand_team Discord