Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Neural Network Dynamics for Model-Based Deep Reinforcement Learning with Model-Free Fine-Tuning

About

Model-free deep reinforcement learning algorithms have been shown to be capable of learning a wide range of robotic skills, but typically require a very large number of samples to achieve good performance. Model-based algorithms, in principle, can provide for much more efficient learning, but have proven difficult to extend to expressive, high-capacity models such as deep neural networks. In this work, we demonstrate that medium-sized neural network models can in fact be combined with model predictive control (MPC) to achieve excellent sample complexity in a model-based reinforcement learning algorithm, producing stable and plausible gaits to accomplish various complex locomotion tasks. We also propose using deep neural network dynamics models to initialize a model-free learner, in order to combine the sample efficiency of model-based approaches with the high task-specific performance of model-free methods. We empirically demonstrate on MuJoCo locomotion tasks that our pure model-based approach trained on just random action data can follow arbitrary trajectories with excellent sample efficiency, and that our hybrid algorithm can accelerate model-free learning on high-speed benchmark tasks, achieving sample efficiency gains of 3-5x on swimmer, cheetah, hopper, and ant agents. Videos can be found at https://sites.google.com/view/mbmf

Anusha Nagabandi, Gregory Kahn, Ronald S. Fearing, Sergey Levine• 2017

Related benchmarks

TaskDatasetResultRank
Continuous ControlBipedalWalker v3
Episodic Cumulative Reward219.6
15
Continuous ControlHalfCheetah v4
Max Average Return4.13e+3
12
Continuous ControlPendulum v1
Average Cumulative Reward-188.3
7
Continuous ControlHumanoid v4
Average Cumulative Reward2.78e+3
7
Robotic ControlPendulum v1
Local Optima Escape Rate42.5
7
Robotic ControlBipedalWalker v3
Local Optima Escape Rate38.3
7
Robotic ControlHalfCheetah v4
Local Optima Escape Rate31.4
7
Robotic ControlHumanoid v4
Local Optima Escape Rate24.9
7
Power System ControlIEEE 39-bus New England test system critical disturbances simulation
Constraint Violations (%)6.2
6
UAV Obstacle AvoidanceUAV Obstacle Avoidance environment 100 trials (test)
Success Rate68.2
6
Showing 10 of 10 rows

Other info

Follow for update