Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Learning Continuous Control Policies by Stochastic Value Gradients

About

We present a unified framework for learning continuous control policies using backpropagation. It supports stochastic control by treating stochasticity in the Bellman equation as a deterministic function of exogenous noise. The product is a spectrum of general policy gradient algorithms that range from model-free methods with value functions to model-based methods without value functions. We use learned models but only require observations from the environment in- stead of observations from model-predicted trajectories, minimizing the impact of compounded model errors. We apply these algorithms first to a toy stochastic control problem and then to several physics-based control problems in simulation. One of these variants, SVG(1), shows the effectiveness of learning models, value functions, and policies simultaneously in continuous domains.

Nicolas Heess, Greg Wayne, David Silver, Timothy Lillicrap, Yuval Tassa, Tom Erez• 2015

Related benchmarks

TaskDatasetResultRank
Continuous ControlBipedalWalker v3
Episodic Cumulative Reward74.8
15
Continuous ControlHalfCheetah v4
Max Average Return1.15e+3
12
Robotic ControlPendulum v1
Local Optima Escape Rate53.8
7
Robotic ControlBipedalWalker v3
Local Optima Escape Rate46.9
7
Robotic ControlHalfCheetah v4
Local Optima Escape Rate39.7
7
Robotic ControlHumanoid v4
Local Optima Escape Rate32.5
7
Continuous ControlPendulum v1
Average Cumulative Reward-214.7
7
Continuous ControlHumanoid v4
Average Cumulative Reward381.4
7
Showing 8 of 8 rows

Other info

Follow for update