AdaGamma: State-Dependent Discounting for Temporal Adaptation in Reinforcement Learning

About

The discount factor in reinforcement learning controls both the effective planning horizon and the strength of bootstrapping, yet most deep RL methods use a single fixed value across all states. While state-dependent discounting is conceptually appealing, naive deep actor--critic implementations can become unstable and degenerate toward TD-error collapse. We propose AdaGamma, a practical deep actor--critic method for state-dependent discounting that learns a state-dependent discount function together with a return-consistency objective to regularize the induced backup structure. On the theory side, we analyze the Bellman operator induced by state-dependent discounting and establish its basic well-posedness properties under suitable conditions. Empirically, AdaGamma integrates into both SAC and PPO, yielding consistent improvements on continuous-control benchmarks, and achieves statistically significant gains in an online A/B test on the JD Logistics platform. These results suggest that state-dependent discounting can be made effective in deep RL when coupled with a return-consistency objective that prevents degenerate target manipulation.

Yaomin Wang, Jianting Pan, Ran Tian, Xiaoyang Li, Yu Zhang, Hengle Qin, Tianshu YU• 2026

Related benchmarks

Task	Dataset	Result
Reinforcement Learning	MountainCarContinuous v0	Average Agent Reward94.6	65
Reinforcement Learning	Acrobot v1	Mean Return-82.61	42
Reinforcement Learning	Ant v4	Average Return3.77e+3	26
Reinforcement Learning	CartPole v1	Return500	16
Reinforcement Learning	Humanoid v4	Reward457	13
High-Dimensional Control	SafetyPointGoal1 v0 (test)	Reward28.25	8
High-Dimensional Locomotion	Humanoid v4 (test)	Reward6.91e+3	8
High-Dimensional Locomotion	Ant v4 (test)	Reward4.13e+3	8
Safety Reinforcement Learning	SafetyPointGoal1 v0	Reward27.45	8
Reinforcement Learning	Pendulum v1	Reward-58.557	4

Showing 10 of 10 rows

Other info

Follow for update

@wizwand_team Discord