Intentionally-underestimated Value Function at Terminal State for Temporal-difference Learning with Mis-designed Reward
About
Robot control using reinforcement learning has become popular, but its learning process generally terminates halfway through an episode for safety and time-saving reasons. This study addresses the problem of the most popular exception handling that temporal-difference (TD) learning performs at such termination. That is, by forcibly assuming zero value after termination, unintentionally implicit underestimation or overestimation occurs, depending on the reward design in the normal states. When the episode is terminated due to task failure, the failure may be highly valued with the unintentional overestimation, and the wrong policy may be acquired. Although this problem can be avoided by paying attention to the reward design, it is essential in practical use of TD learning to review the exception handling at termination. This paper therefore proposes a method to intentionally underestimate the value after termination to avoid learning failures due to the unintentional overestimation. In addition, the degree of underestimation is adjusted according to the degree of stationarity at termination, thereby preventing excessive exploration due to the intentional underestimation. Simulations and real robot experiments showed that the proposed method can stably obtain the optimal policies for various tasks and reward designs. https://youtu.be/AxXr8uFOe7M
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Continuous Control | MuJoCo Reacher | Average Reward-6.1 | 18 | |
| CartPole (sparse) | dm_control (test) | Mean Return724.9 | 6 | |
| Continuous Control (Negative Reward) | Pendulum Mujoco | Mean Return8.13e+3 | 6 | |
| Continuous Control (Negative Reward) | Pendulum Pybullet | Mean Return9.12e+3 | 6 | |
| Continuous Control (Negative Reward) | Reacher Mujoco | Mean Return-6.3 | 6 | |
| Continuous Control (Negative Reward) | Reacher Pybullet | Mean Return16.8 | 6 | |
| Continuous Control (Positive Reward) | Pendulum Mujoco | Return9.36e+3 | 6 | |
| Continuous Control (Positive Reward) | Pendulum Pybullet | Return9.04e+3 | 6 | |
| Continuous Control (Positive Reward) | Reacher Pybullet | Mean Return18.7 | 6 | |
| Cheetah (complex) | dm_control (test) | Mean Return692.6 | 6 |