The Optimal Reward Baseline for Gradient-Based Reinforcement Learning
About
There exist a number of reinforcement learning algorithms which learnby climbing the gradient of expected reward. Their long-runconvergence has been proved, even in partially observableenvironments with non-deterministic actions, and without the need fora system model. However, the variance of the gradient estimator hasbeen found to be a significant practical problem. Recent approacheshave discounted future rewards, introducing a bias-variance trade-offinto the gradient estimate. We incorporate a reward baseline into thelearning system, and show that it affects variance without introducingfurther bias. In particular, as we approach the zero-bias,high-variance parameterization, the optimal (or variance minimizing)constant reward baseline is equal to the long-term average expectedreward. Modified policy-gradient algorithms are presented, and anumber of experiments demonstrate their improvement over previous work.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Mathematical Reasoning | AIME 24 | Avg@32 Accuracy33.13 | 23 | |
| Multi-Turn Tool-Integrated Reasoning (TIR) | AIME25 | Peak avg@32 Score21.15 | 6 | |
| Multi-Turn Tool-Integrated Reasoning (TIR) | AIME24 | Peak avg@32 score30.63 | 6 | |
| Multi-Turn Tool-Integrated Reasoning (TIR) | AMC23 | Peak avg@32 Score62.5 | 6 | |
| Multi-Turn Tool-Integrated Reasoning (TIR) | MATH500 | Peak avg@32 Score75.69 | 6 | |
| Single-Turn Mathematical Reasoning | AIME 25 | Peak avg@32 Score30.1 | 5 | |
| Single-Turn Mathematical Reasoning | MATH500 | Peak Avg Score90.75 | 5 | |
| Single-Turn Mathematical Reasoning | AMC 23 | Peak avg@32 Score80.55 | 5 |