Constraints Penalized Q-learning for Safe Offline Reinforcement Learning

About

We study the problem of safe offline reinforcement learning (RL), the goal is to learn a policy that maximizes long-term reward while satisfying safety constraints given only offline data, without further interaction with the environment. This problem is more appealing for real world RL applications, in which data collection is costly or dangerous. Enforcing constraint satisfaction is non-trivial, especially in offline settings, as there is a potential large discrepancy between the policy distribution and the data distribution, causing errors in estimating the value of safety constraints. We show that na\"ive approaches that combine techniques from safe RL and offline RL can only learn sub-optimal solutions. We thus develop a simple yet effective algorithm, Constraints Penalized Q-Learning (CPQ), to solve the problem. Our method admits the use of data generated by mixed behavior policies. We present a theoretical analysis and demonstrate empirically that our approach can learn robustly across a variety of benchmark control tasks, outperforming several baselines.

Haoran Xu, Xianyuan Zhan, Xiangyu Zhu• 2021

Related benchmarks

Task	Dataset	Result
Safe Reinforcement Learning	Safety-Gymnasium AntVelocity OSRL	Cost0.00e+0	48
Safe Reinforcement Learning	Safety-Gymnasium PointPush (OSRL)	Cost0.00e+0	48
Safe Reinforcement Learning	Safety-Gymnasium CarCircle OSRL	Cost0.00e+0	48
Safe Reinforcement Learning	Safety-Gymnasium PointGoal OSRL	Cost4.9	48
PointButton1	Safety Gymnasium	Normalized Reward69	21
PointButton2	Safety Gymnasium	Normalized Reward58	21
PointGoal1	Safety Gymnasium	Normalized Reward0.74	21
PointGoal2	Safety Gymnasium	Normalized Reward67	21
PointPush1	Safety Gymnasium	Normalized Reward33	21
PointPush2	Safety Gymnasium	Normalized Reward23	21

Showing 10 of 80 rows

...

Other info

Follow for update

@wizwand_team Discord