AdaGamma: State-Dependent Discounting for Temporal Adaptation in Reinforcement Learning
About
The discount factor in reinforcement learning controls both the effective planning horizon and the strength of bootstrapping, yet most deep RL methods use a single fixed value across all states. While state-dependent discounting is conceptually appealing, naive deep actor--critic implementations can become unstable and degenerate toward TD-error collapse. We propose AdaGamma, a practical deep actor--critic method for state-dependent discounting that learns a state-dependent discount function together with a return-consistency objective to regularize the induced backup structure. On the theory side, we analyze the Bellman operator induced by state-dependent discounting and establish its basic well-posedness properties under suitable conditions. Empirically, AdaGamma integrates into both SAC and PPO, yielding consistent improvements on continuous-control benchmarks, and achieves statistically significant gains in an online A/B test on the JD Logistics platform. These results suggest that state-dependent discounting can be made effective in deep RL when coupled with a return-consistency objective that prevents degenerate target manipulation.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Reinforcement Learning | MountainCarContinuous v0 | Average Agent Reward94.6 | 65 | |
| Reinforcement Learning | Acrobot v1 | Mean Return-82.61 | 42 | |
| Reinforcement Learning | Ant v4 | Average Return3.77e+3 | 18 | |
| Reinforcement Learning | CartPole v1 | Return500 | 16 | |
| Reinforcement Learning | Humanoid v4 | Reward457 | 9 | |
| High-Dimensional Control | SafetyPointGoal1 v0 (test) | Reward28.25 | 8 | |
| High-Dimensional Locomotion | Humanoid v4 (test) | Reward6.91e+3 | 8 | |
| High-Dimensional Locomotion | Ant v4 (test) | Reward4.13e+3 | 8 | |
| Safety Reinforcement Learning | SafetyPointGoal1 v0 | Reward27.45 | 8 | |
| Reinforcement Learning | Pendulum v1 | Reward-58.557 | 4 |