Learning Continuous Control Policies by Stochastic Value Gradients
About
We present a unified framework for learning continuous control policies using backpropagation. It supports stochastic control by treating stochasticity in the Bellman equation as a deterministic function of exogenous noise. The product is a spectrum of general policy gradient algorithms that range from model-free methods with value functions to model-based methods without value functions. We use learned models but only require observations from the environment in- stead of observations from model-predicted trajectories, minimizing the impact of compounded model errors. We apply these algorithms first to a toy stochastic control problem and then to several physics-based control problems in simulation. One of these variants, SVG(1), shows the effectiveness of learning models, value functions, and policies simultaneously in continuous domains.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Continuous Control | BipedalWalker v3 | Episodic Cumulative Reward74.8 | 15 | |
| Continuous Control | HalfCheetah v4 | Max Average Return1.15e+3 | 12 | |
| Robotic Control | Pendulum v1 | Local Optima Escape Rate53.8 | 7 | |
| Robotic Control | BipedalWalker v3 | Local Optima Escape Rate46.9 | 7 | |
| Robotic Control | HalfCheetah v4 | Local Optima Escape Rate39.7 | 7 | |
| Robotic Control | Humanoid v4 | Local Optima Escape Rate32.5 | 7 | |
| Continuous Control | Pendulum v1 | Average Cumulative Reward-214.7 | 7 | |
| Continuous Control | Humanoid v4 | Average Cumulative Reward381.4 | 7 |