RL$^2$: Fast Reinforcement Learning via Slow Reinforcement Learning
About
Deep reinforcement learning (deep RL) has been successful in learning sophisticated behaviors automatically; however, the learning process requires a huge number of trials. In contrast, animals can learn new tasks in just a few trials, benefiting from their prior knowledge about the world. This paper seeks to bridge this gap. Rather than designing a "fast" reinforcement learning algorithm, we propose to represent it as a recurrent neural network (RNN) and learn it from data. In our proposed method, RL$^2$, the algorithm is encoded in the weights of the RNN, which are learned slowly through a general-purpose ("slow") RL algorithm. The RNN receives all information a typical RL algorithm would receive, including observations, actions, rewards, and termination flags; and it retains its state across episodes in a given Markov Decision Process (MDP). The activations of the RNN store the state of the "fast" RL algorithm on the current (previously unseen) MDP. We evaluate RL$^2$ experimentally on both small-scale and large-scale problems. On the small-scale side, we train it to solve randomly generated multi-arm bandit problems and finite MDPs. After RL$^2$ is trained, its performance on new MDPs is close to human-designed algorithms with optimality guarantees. On the large-scale side, we test RL$^2$ on a vision-based navigation task and show that it scales up to high-dimensional problems.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Reach | Meta-World ML-1 (test) | Success Rate100 | 9 | |
| Continuous Control | reacher | Average Reward-150.6 | 9 | |
| Task Generalization | Meta-World ML-10 (test) | Success Rate35.8 | 8 | |
| Task Generalization | Meta-World ML-45 (test) | Success Rate33.3 | 8 | |
| Reinforcement Learning | Half-cheetah-velocity (train) | Runtime (hours)25 | 7 | |
| Navigation | GridWorld | Avg Episode Return33.4 | 6 | |
| Locomotion | HalfCheetah Dir | Avg Episode Return-420 | 6 | |
| Locomotion | HalfCheetah Vel | Avg Episode Return-513.2 | 6 | |
| Locomotion | Wind+Vel | Avg Episode Return-493.5 | 6 | |
| Robot Assistance | ScratchItch | Average Episode Return50.4 | 6 |