Projection-Based Constrained Policy Optimization

About

We consider the problem of learning control policies that optimize a reward function while satisfying constraints due to considerations of safety, fairness, or other costs. We propose a new algorithm, Projection-Based Constrained Policy Optimization (PCPO). This is an iterative method for optimizing policies in a two-step process: the first step performs a local reward improvement update, while the second step reconciles any constraint violation by projecting the policy back onto the constraint set. We theoretically analyze PCPO and provide a lower bound on reward improvement, and an upper bound on constraint violation, for each policy update. We further characterize the convergence of PCPO based on two different metrics: $\normltwo$ norm and Kullback-Leibler divergence. Our empirical results over several control tasks demonstrate that PCPO achieves superior performance, averaging more than 3.5 times less constraint violation and around 15\% higher reward compared to state-of-the-art methods.

Tsung-Yen Yang, Justinian Rosca, Karthik Narasimhan, Peter J. Ramadge• 2020

Related benchmarks

Task	Dataset	Result
Car Circle	Safety Gymnasium level-2	Safe Reward11	12
Hopper Velocity	Safety Gymnasium level-2	Safe Reward460	12
Car Push	Safety Gymnasium level-2	Safe Reward-0.48	12
Point Push	Safety Gymnasium level-2	Safe Reward-1.1	12
Swimmer Velocity	Safety Gymnasium level-2	Safe Reward0.00e+0	12
Point Button	Safety Gymnasium level-2	Safe Reward-1.6	12
Point Goal	Safety Gymnasium level-2	Safe Reward-1.3	12
Car Goal	Safety Gymnasium level-2	Safe Reward-0.75	12
Safety-constrained Reinforcement Learning	Safety-Gym SafetyPointGoal1 (evaluation)	Average Reward-0.73	11
Safety-constrained Reinforcement Learning	Safety-Gym SafetyPointCircle1 (evaluation)	Average Reward3.13	11

Showing 10 of 18 rows

Other info

Follow for update

@wizwand_team Discord