Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Supported Policy Optimization for Offline Reinforcement Learning

About

Policy constraint methods to offline reinforcement learning (RL) typically utilize parameterization or regularization that constrains the policy to perform actions within the support set of the behavior policy. The elaborative designs of parameterization methods usually intrude into the policy networks, which may bring extra inference cost and cannot take full advantage of well-established online methods. Regularization methods reduce the divergence between the learned policy and the behavior policy, which may mismatch the inherent density-based definition of support set thereby failing to avoid the out-of-distribution actions effectively. This paper presents Supported Policy OpTimization (SPOT), which is directly derived from the theoretical formalization of the density-based support constraint. SPOT adopts a VAE-based density estimator to explicitly model the support set of behavior policy and presents a simple but effective density-based regularization term, which can be plugged non-intrusively into off-the-shelf off-policy RL algorithms. SPOT achieves the state-of-the-art performance on standard benchmarks for offline RL. Benefiting from the pluggable design, offline pretrained models from SPOT can also be applied to perform online fine-tuning seamlessly.

Jialong Wu, Haixu Wu, Zihan Qiu, Jianmin Wang, Mingsheng Long• 2022

Related benchmarks

TaskDatasetResultRank
Offline Reinforcement LearningD4RL Walker2d Medium v2
Normalized Return86.4
67
Offline Reinforcement LearningD4RL halfcheetah v2 (medium-replay)
Normalized Score52.2
58
Offline Reinforcement LearningD4RL Hopper-medium-replay v2
Normalized Return100.2
54
Offline Reinforcement LearningD4RL Hopper-medium-expert v2
Normalized Return99.3
49
Offline Reinforcement LearningD4RL walker2d-medium-expert v2
Normalized Score112
44
Offline Reinforcement LearningD4RL HalfCheetah Medium v2
Average Normalized Return58.4
43
Offline Reinforcement LearningD4RL Hopper Medium v2
Normalized Return86
43
Offline Reinforcement LearningD4RL AntMaze
AntMaze Umaze Return93.5
39
Offline Reinforcement LearningD4RL walker2d medium-replay v2
Normalized Score91.6
36
Offline Reinforcement LearningD4RL MuJoCo Hopper medium standard
Normalized Score86
36
Showing 10 of 29 rows

Other info

Code

Follow for update