Projection-Based Constrained Policy Optimization
About
We consider the problem of learning control policies that optimize a reward function while satisfying constraints due to considerations of safety, fairness, or other costs. We propose a new algorithm, Projection-Based Constrained Policy Optimization (PCPO). This is an iterative method for optimizing policies in a two-step process: the first step performs a local reward improvement update, while the second step reconciles any constraint violation by projecting the policy back onto the constraint set. We theoretically analyze PCPO and provide a lower bound on reward improvement, and an upper bound on constraint violation, for each policy update. We further characterize the convergence of PCPO based on two different metrics: $\normltwo$ norm and Kullback-Leibler divergence. Our empirical results over several control tasks demonstrate that PCPO achieves superior performance, averaging more than 3.5 times less constraint violation and around 15\% higher reward compared to state-of-the-art methods.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Car Circle | Safety Gymnasium level-2 | Safe Reward11 | 12 | |
| Hopper Velocity | Safety Gymnasium level-2 | Safe Reward460 | 12 | |
| Car Push | Safety Gymnasium level-2 | Safe Reward-0.48 | 12 | |
| Point Push | Safety Gymnasium level-2 | Safe Reward-1.1 | 12 | |
| Swimmer Velocity | Safety Gymnasium level-2 | Safe Reward0.00e+0 | 12 | |
| Point Button | Safety Gymnasium level-2 | Safe Reward-1.6 | 12 | |
| Point Goal | Safety Gymnasium level-2 | Safe Reward-1.3 | 12 | |
| Car Goal | Safety Gymnasium level-2 | Safe Reward-0.75 | 12 | |
| Constrained Reinforcement Learning | AntCircle | Episodic Reward168.3 | 8 | |
| Constrained Reinforcement Learning | Humanoid | Episodic Reward1.60e+3 | 8 |