Grid-Mapping Pseudo-Count Constraint for Offline Reinforcement Learning
About
Offline reinforcement learning learns from a static dataset without interacting with environments, which ensures security and thus owns a good application prospect. However, directly applying naive reinforcement learning algorithm usually fails in an offline environment due to inaccurate Q value approximation caused by out-of-distribution (OOD) state-actions. It is an effective way to solve this problem by penalizing the Q-value of OOD state-actions. Among the methods of punishing OOD state-actions, count-based methods have achieved good results in discrete domains in a simple form. Inspired by it, a novel pseudo-count method for continuous domains called Grid-Mapping Pseudo-Count method (GPC) is proposed by extending the count-based method from discrete to continuous domains. Firstly, the continuous state and action space are mapped to discrete space using Grid-Mapping, then the Q-values of OOD state-actions are constrained through pseudo-count. Secondly, the theoretical proof is given to show that GPC can obtain appropriate uncertainty constraints under fewer assumptions than other pseudo-count methods. Thirdly, GPC is combined with Soft Actor-Critic algorithm (SAC) to get a new algorithm called GPC-SAC. Lastly, experiments on D4RL datasets are given to show that GPC-SAC has better performance and less computational cost than other algorithms that constrain the Q-value.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Offline Reinforcement Learning | hopper medium | Normalized Score82.9 | 52 | |
| Offline Reinforcement Learning | walker2d medium | Normalized Score87.6 | 51 | |
| Offline Reinforcement Learning | walker2d medium-replay | Normalized Score86.2 | 50 | |
| Offline Reinforcement Learning | hopper medium-replay | Normalized Score97.5 | 44 | |
| Offline Reinforcement Learning | halfcheetah medium | Normalized Score60.8 | 43 | |
| Offline Reinforcement Learning | halfcheetah medium-replay | Normalized Score55.7 | 43 | |
| Offline Reinforcement Learning | Maze2D umaze | Normalized Return141 | 38 | |
| Offline Reinforcement Learning | Maze2D medium | Normalized Return103.7 | 38 | |
| Offline Reinforcement Learning | Walker2d medium-expert | Normalized Score111.7 | 31 | |
| Offline Reinforcement Learning | Hopper medium-expert | Normalized Score111.6 | 24 |