Grid-Mapping Pseudo-Count Constraint for Offline Reinforcement Learning

About

Offline reinforcement learning learns from a static dataset without interacting with environments, which ensures security and thus owns a good application prospect. However, directly applying naive reinforcement learning algorithm usually fails in an offline environment due to inaccurate Q value approximation caused by out-of-distribution (OOD) state-actions. It is an effective way to solve this problem by penalizing the Q-value of OOD state-actions. Among the methods of punishing OOD state-actions, count-based methods have achieved good results in discrete domains in a simple form. Inspired by it, a novel pseudo-count method for continuous domains called Grid-Mapping Pseudo-Count method (GPC) is proposed by extending the count-based method from discrete to continuous domains. Firstly, the continuous state and action space are mapped to discrete space using Grid-Mapping, then the Q-values of OOD state-actions are constrained through pseudo-count. Secondly, the theoretical proof is given to show that GPC can obtain appropriate uncertainty constraints under fewer assumptions than other pseudo-count methods. Thirdly, GPC is combined with Soft Actor-Critic algorithm (SAC) to get a new algorithm called GPC-SAC. Lastly, experiments on D4RL datasets are given to show that GPC-SAC has better performance and less computational cost than other algorithms that constrain the Q-value.

Yi Shen, Hanyan Huang• 2024

Related benchmarks

Task	Dataset	Result
Offline Reinforcement Learning	hopper medium	Normalized Score82.9	68
Offline Reinforcement Learning	walker2d medium-replay	Normalized Score86.2	61
Offline Reinforcement Learning	walker2d medium	Normalized Score87.6	61
Offline Reinforcement Learning	hopper medium-replay	Normalized Score97.5	55
Offline Reinforcement Learning	halfcheetah medium-replay	Normalized Score55.7	54
Offline Reinforcement Learning	halfcheetah medium	Normalized Score60.8	53
Offline Reinforcement Learning	Walker2d medium-expert	Normalized Score111.7	42
Offline Reinforcement Learning	Maze2D umaze	Normalized Return141	38
Offline Reinforcement Learning	Maze2D medium	Normalized Return103.7	38
Offline Reinforcement Learning	D4RL Walker2d expert	Mean Normalized Score111.7	38

Showing 10 of 20 rows

Other info

Follow for update

@wizwand_team Discord