An Emphatic Approach to the Problem of Off-policy Temporal-Difference Learning
About
In this paper we introduce the idea of improving the performance of parametric temporal-difference (TD) learning algorithms by selectively emphasizing or de-emphasizing their updates on different time steps. In particular, we show that varying the emphasis of linear TD($\lambda$)'s updates in a particular way causes its expected update to become stable under off-policy training. The only prior model-free TD methods to achieve this with per-step computation linear in the number of function approximation parameters are the gradient-TD family of methods including TDC, GTD($\lambda$), and GQ($\lambda$). Compared to these methods, our _emphatic TD($\lambda$)_ is simpler and easier to use; it has only one learned parameter vector and one step-size parameter. Our treatment includes general state-dependent discounting and bootstrapping functions, and a way of specifying varying degrees of interest in accurately valuing different states.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Off-policy prediction | RW tabular | Tail-average RMSE0.04 | 16 | |
| Off-policy prediction | Boyan chain | Tail-average RMSE0.172 | 16 | |
| Linear off-policy prediction | Baird environment | Max RMSE2.21 | 8 | |
| Linear off-policy prediction | New two-state environment | Max RMSE3.89 | 8 | |
| Linear off-policy prediction | Two-state environment | Max RMSE1.72 | 8 | |
| Off-policy prediction | RW inverted | Tail-average RMSE0.047 | 8 | |
| Off-policy prediction | Two-state | Tail-average RMSE1.14 | 7 | |
| Off-policy prediction | New two-state | Tail-Average RMSE8.573 | 7 |