Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

VariBAD: A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning

About

Trading off exploration and exploitation in an unknown environment is key to maximising expected return during learning. A Bayes-optimal policy, which does so optimally, conditions its actions not only on the environment state but on the agent's uncertainty about the environment. Computing a Bayes-optimal policy is however intractable for all but the smallest tasks. In this paper, we introduce variational Bayes-Adaptive Deep RL (variBAD), a way to meta-learn to perform approximate inference in an unknown environment, and incorporate task uncertainty directly during action selection. In a grid-world domain, we illustrate how variBAD performs structured online exploration as a function of task uncertainty. We further evaluate variBAD on MuJoCo domains widely used in meta-RL and show that it achieves higher online return than existing methods.

Luisa Zintgraf, Kyriacos Shiarlis, Maximilian Igl, Sebastian Schulze, Yarin Gal, Katja Hofmann, Shimon Whiteson• 2019

Related benchmarks

TaskDatasetResultRank
PushMeta-World ML-1 (test)
Success Rate0.88
12
ReachMetaWorld ML1 Reach-OOD-Extra (extrapolation)
Success Rate82
9
Continuous ControlMuJoCo HalfCheetah Vel (test)
Mean Return-82
9
PushMetaWorld ML1 Push OOD (interpolation)
Average Success Rate83
9
PushMetaWorld ML1 Push-OOD-Extra (extrapolation)
Average Success Rate65
9
ReachMetaWorld ML1 Reach
Average Success Rate73
9
ReachMetaWorld ML1 Reach-OOD (interpolation)
Average Success Rate82
9
Continuous Controlreacher
Average Reward-102.4
9
Continuous ControlMuJoCo HalfCheetah 10D-task (a)
Mean Return1.89e+3
7
Continuous ControlMuJoCo HalfCheetah 10D-task (b)
Mean Return1.98e+3
7
Showing 10 of 44 rows

Other info

Follow for update