Policy Optimization for Continuous Reinforcement Learning
About
We study reinforcement learning (RL) in the setting of continuous time and space, for an infinite horizon with a discounted objective and the underlying dynamics driven by a stochastic differential equation. Built upon recent advances in the continuous approach to RL, we develop a notion of occupation time (specifically for a discounted objective), and show how it can be effectively used to derive performance-difference and local-approximation formulas. We further extend these results to illustrate their applications in the PG (policy gradient) and TRPO/PPO (trust region policy optimization/ proximal policy optimization) methods, which have been familiar and powerful tools in the discrete RL setting but under-developed in continuous RL. Through numerical experiments, we demonstrate the effectiveness and advantages of our approach.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Reinforcement Learning | Walker | Average Returns51.77 | 38 | |
| Quadruped | Quadruped | Return160.2 | 33 | |
| Reinforcement Learning | Humanoid | Zero-Shot Reward1.16 | 30 | |
| Reinforcement Learning | Trading | Return23.46 | 24 | |
| Reinforcement Learning | cheetah | Return174.5 | 24 |