Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Penalized Proximal Policy Optimization for Safe Reinforcement Learning

About

Safe reinforcement learning aims to learn the optimal policy while satisfying safety constraints, which is essential in real-world applications. However, current algorithms still struggle for efficient policy updates with hard constraint satisfaction. In this paper, we propose Penalized Proximal Policy Optimization (P3O), which solves the cumbersome constrained policy iteration via a single minimization of an equivalent unconstrained problem. Specifically, P3O utilizes a simple-yet-effective penalty function to eliminate cost constraints and removes the trust-region constraint by the clipped surrogate objective. We theoretically prove the exactness of the proposed method with a finite penalty factor and provide a worst-case analysis for approximate error when evaluated on sample trajectories. Moreover, we extend P3O to more challenging multi-constraint and multi-agent scenarios which are less studied in previous work. Extensive experiments show that P3O outperforms state-of-the-art algorithms with respect to both reward improvement and constraint satisfaction on a set of constrained locomotive tasks.

Linrui Zhang, Li Shen, Long Yang, Shixiang Chen, Bo Yuan, Xueqian Wang, Dacheng Tao• 2022

Related benchmarks

TaskDatasetResultRank
Swimmer VelocitySafety Gymnasium level-2
Safe Reward94
12
Car PushSafety Gymnasium level-2
Safe Reward0.29
12
Point PushSafety Gymnasium level-2
Safe Reward0.23
12
Car GoalSafety Gymnasium level-2
Safe Reward0.36
12
Point ButtonSafety Gymnasium level-2
Safe Reward-0.06
12
Point GoalSafety Gymnasium level-2
Safe Reward-0.024
12
Car CircleSafety Gymnasium level-2
Safe Reward7.1
12
Hopper VelocitySafety Gymnasium level-2
Safe Reward0.00e+0
12
Constrained Reinforcement LearningHumanoid
Episodic Reward1.67e+3
8
Constrained Reinforcement LearningAntCircle
Episodic Reward182.6
8
Showing 10 of 20 rows

Other info

Follow for update