Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Deep Bayesian Bandits Showdown: An Empirical Comparison of Bayesian Deep Networks for Thompson Sampling

About

Recent advances in deep reinforcement learning have made significant strides in performance on applications such as Go and Atari games. However, developing practical methods to balance exploration and exploitation in complex domains remains largely unsolved. Thompson Sampling and its extension to reinforcement learning provide an elegant approach to exploration that only requires access to posterior samples of the model. At the same time, advances in approximate Bayesian methods have made posterior approximation for flexible neural network models practical. Thus, it is attractive to consider approximate Bayesian neural networks in a Thompson Sampling framework. To understand the impact of using an approximate posterior on Thompson Sampling, we benchmark well-established and recently developed methods for approximate posterior sampling combined with Thompson Sampling over a series of contextual bandit problems. We found that many approaches that have been successful in the supervised learning setting underperformed in the sequential decision-making scenario. In particular, we highlight the challenge of adapting slowly converging uncertainty estimates to the online setting.

Carlos Riquelme, George Tucker, Jasper Snoek• 2018

Related benchmarks

TaskDatasetResultRank
Regressionelevators (test)
RMSE0.101
19
RegressionProtein (test)
Test Log Likelihood-0.884
18
RegressionSkillcraft (test)
Log Likelihood (Test)-1.002
17
RegressionProtein (test)
RMSE0.447
10
Contextual BanditWheel Bandit delta=0.99
Normalized Cumulative Regret86.03
9
Contextual BanditWheel Bandit delta=0.50
Normalized Cumulative Regret18.71
9
Contextual BanditWheel Bandit (delta=0.70)
Normalized Cumulative Regret26.63
9
Contextual BanditWheel Bandit delta=0.90
Normalized Cumulative Regret45.47
9
Contextual BanditWheel Bandit delta=0.95
Normalized Cumulative Regret65.44
9
Contextual BanditMushroom
Relative Cumulative Regret2.66
9
Showing 10 of 20 rows

Other info

Code

Follow for update