Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Proximal Gradient Temporal Difference Learning: Stable Reinforcement Learning with Polynomial Sample Complexity

About

In this paper, we introduce proximal gradient temporal difference learning, which provides a principled way of designing and analyzing true stochastic gradient temporal difference learning algorithms. We show how gradient TD (GTD) reinforcement learning methods can be formally derived, not by starting from their original objective functions, as previously attempted, but rather from a primal-dual saddle-point objective function. We also conduct a saddle-point error analysis to obtain finite-sample bounds on their performance. Previous analyses of this class of algorithms use stochastic approximation techniques to prove asymptotic convergence, and do not provide any finite-sample analysis. We also propose an accelerated algorithm, called GTD2-MP, that uses proximal ``mirror maps'' to yield an improved convergence rate. The results of our theoretical analysis imply that the GTD family of algorithms are comparable and may indeed be preferred over existing least squares TD methods for off-policy learning, due to their linear complexity. We provide experimental results showing the improved performance of our accelerated gradient TD methods.

Bo Liu, Ian Gemp, Mohammad Ghavamzadeh, Ji Liu, Sridhar Mahadevan, Marek Petrik• 2020

Related benchmarks

TaskDatasetResultRank
Off-policy predictionBoyan chain
Tail-average RMSE0.2747
16
Off-policy predictionBaird's counterexample
Steady-state AUC Error1.933
15
Off-policy predictionRandom Walk
Steady-state AUC Error0.1013
15
Off-policy predictionTwo-state environment
Steady-state AUC Error3.67
9
Off-policy predictionTwo-state
RMSVE6.66
9
Off-policy predictionRandom Walk
RMSVE0.0605
9
Off-policy predictionBoyan Chain environment
Steady-state AUC Error0.9017
9
Showing 7 of 7 rows

Other info

Follow for update