VariBAD: A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning
About
Trading off exploration and exploitation in an unknown environment is key to maximising expected return during learning. A Bayes-optimal policy, which does so optimally, conditions its actions not only on the environment state but on the agent's uncertainty about the environment. Computing a Bayes-optimal policy is however intractable for all but the smallest tasks. In this paper, we introduce variational Bayes-Adaptive Deep RL (variBAD), a way to meta-learn to perform approximate inference in an unknown environment, and incorporate task uncertainty directly during action selection. In a grid-world domain, we illustrate how variBAD performs structured online exploration as a function of task uncertainty. We further evaluate variBAD on MuJoCo domains widely used in meta-RL and show that it achieves higher online return than existing methods.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Continuous Control | MuJoCo HalfCheetah Vel (test) | Mean Return-82 | 9 | |
| Continuous Control | reacher | Average Reward-102.4 | 9 | |
| Continuous Control | MuJoCo HalfCheetah 10D-task (a) | Mean Return1.89e+3 | 7 | |
| Continuous Control | MuJoCo HalfCheetah 10D-task (b) | Mean Return1.98e+3 | 7 | |
| Continuous Control | MuJoCo HalfCheetah 10D-task (c) | Mean Return1.62e+3 | 7 | |
| Continuous Control | MuJoCo HalfCheetah Body (test) | Mean Return1.62e+3 | 7 | |
| Meta-Reinforcement Learning | MuJoCo HalfCheetah Velocity variation (test) | CVaR 0.05 Return-202 | 7 | |
| Meta-Reinforcement Learning | MuJoCo HalfCheetah Body variation (test) | CVaR 0.05 Return835 | 7 | |
| Continuous Control | MuJoCo HalfCheetah Mass (test) | Mean Return1.56e+3 | 7 | |
| Reinforcement Learning | Half-cheetah-velocity (train) | Runtime (hours)10 | 7 |