Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

An Emphatic Approach to the Problem of Off-policy Temporal-Difference Learning

About

In this paper we introduce the idea of improving the performance of parametric temporal-difference (TD) learning algorithms by selectively emphasizing or de-emphasizing their updates on different time steps. In particular, we show that varying the emphasis of linear TD($\lambda$)'s updates in a particular way causes its expected update to become stable under off-policy training. The only prior model-free TD methods to achieve this with per-step computation linear in the number of function approximation parameters are the gradient-TD family of methods including TDC, GTD($\lambda$), and GQ($\lambda$). Compared to these methods, our _emphatic TD($\lambda$)_ is simpler and easier to use; it has only one learned parameter vector and one step-size parameter. Our treatment includes general state-dependent discounting and bootstrapping functions, and a way of specifying varying degrees of interest in accurately valuing different states.

Richard S. Sutton, A. Rupam Mahmood, Martha White• 2015

Related benchmarks

TaskDatasetResultRank
Off-policy predictionRW tabular
Tail-average RMSE0.04
16
Off-policy predictionBoyan chain
Tail-average RMSE0.172
16
Linear off-policy predictionBaird environment
Max RMSE2.21
8
Linear off-policy predictionNew two-state environment
Max RMSE3.89
8
Linear off-policy predictionTwo-state environment
Max RMSE1.72
8
Off-policy predictionRW inverted
Tail-average RMSE0.047
8
Off-policy predictionTwo-state
Tail-average RMSE1.14
7
Off-policy predictionNew two-state
Tail-Average RMSE8.573
7
Showing 8 of 8 rows

Other info

Follow for update