Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

The Optimal Reward Baseline for Gradient-Based Reinforcement Learning

About

There exist a number of reinforcement learning algorithms which learnby climbing the gradient of expected reward. Their long-runconvergence has been proved, even in partially observableenvironments with non-deterministic actions, and without the need fora system model. However, the variance of the gradient estimator hasbeen found to be a significant practical problem. Recent approacheshave discounted future rewards, introducing a bias-variance trade-offinto the gradient estimate. We incorporate a reward baseline into thelearning system, and show that it affects variance without introducingfurther bias. In particular, as we approach the zero-bias,high-variance parameterization, the optimal (or variance minimizing)constant reward baseline is equal to the long-term average expectedreward. Modified policy-gradient algorithms are presented, and anumber of experiments demonstrate their improvement over previous work.

Lex Weaver, Nigel Tao• 2013

Related benchmarks

TaskDatasetResultRank
Mathematical ReasoningAIME 24
Avg@32 Accuracy33.13
23
Multi-Turn Tool-Integrated Reasoning (TIR)AIME25
Peak avg@32 Score21.15
6
Multi-Turn Tool-Integrated Reasoning (TIR)AIME24
Peak avg@32 score30.63
6
Multi-Turn Tool-Integrated Reasoning (TIR)AMC23
Peak avg@32 Score62.5
6
Multi-Turn Tool-Integrated Reasoning (TIR)MATH500
Peak avg@32 Score75.69
6
Single-Turn Mathematical ReasoningAIME 25
Peak avg@32 Score30.1
5
Single-Turn Mathematical ReasoningMATH500
Peak Avg Score90.75
5
Single-Turn Mathematical ReasoningAMC 23
Peak avg@32 Score80.55
5
Showing 8 of 8 rows

Other info

Follow for update