First Order Constrained Optimization in Policy Space
About
In reinforcement learning, an agent attempts to learn high-performing behaviors through interacting with the environment, such behaviors are often quantified in the form of a reward function. However some aspects of behavior-such as ones which are deemed unsafe and to be avoided-are best captured through constraints. We propose a novel approach called First Order Constrained Optimization in Policy Space (FOCOPS) which maximizes an agent's overall reward while ensuring the agent satisfies a set of cost constraints. Using data generated from the current policy, FOCOPS first finds the optimal update policy by solving a constrained optimization problem in the nonparameterized policy space. FOCOPS then projects the update policy back into the parametric policy space. Our approach has an approximate upper bound for worst-case constraint violation throughout training and is first-order in nature therefore simple to implement. We provide empirical evidence that our simple approach achieves better performance on a set of constrained robotics locomotive tasks.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Car Goal | Safety Gymnasium level-2 | Safe Reward1.6 | 12 | |
| Point Button | Safety Gymnasium level-2 | Safe Reward1.3 | 12 | |
| Point Goal | Safety Gymnasium level-2 | Safe Reward2.2 | 12 | |
| Hopper Velocity | Safety Gymnasium level-2 | Safe Reward970 | 12 | |
| Car Circle | Safety Gymnasium level-2 | Safe Reward8.1 | 12 | |
| Car Push | Safety Gymnasium level-2 | Safe Reward0.05 | 12 | |
| Point Push | Safety Gymnasium level-2 | Safe Reward-0.2 | 12 | |
| Swimmer Velocity | Safety Gymnasium level-2 | Safe Reward15 | 12 | |
| Constrained Reinforcement Learning | Humanoid | Episodic Reward1.73e+3 | 8 | |
| Constrained Reinforcement Learning | PointCircle | Episodic Reward81.6 | 8 |