Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Beyond UCB: Optimal and Efficient Contextual Bandits with Regression Oracles

About

A fundamental challenge in contextual bandits is to develop flexible, general-purpose algorithms with computational requirements no worse than classical supervised learning tasks such as classification and regression. Algorithms based on regression have shown promising empirical success, but theoretical guarantees have remained elusive except in special cases. We provide the first universal and optimal reduction from contextual bandits to online regression. We show how to transform any oracle for online regression with a given value function class into an algorithm for contextual bandits with the induced policy class, with no overhead in runtime or memory requirements. We characterize the minimax rates for contextual bandits with general, potentially nonparametric function classes, and show that our algorithm is minimax optimal whenever the oracle obtains the optimal rate for regression. Compared to previous results, our algorithm requires no distributional assumptions beyond realizability, and works even when contexts are chosen adversarially.

Dylan J. Foster, Alexander Rakhlin• 2020

Related benchmarks

TaskDatasetResultRank
Policy Optimization in CMAB1012 2
PV-loss (Mean Diff)0.0381
5
Policy Optimization in CMAB476 2
Mean Diff from Supervised (PV-loss)-0.012
5
Policy Optimization in CMAB457_4
Mean PV-loss Difference0.0963
5
Policy Optimization in CMAB729 2
Mean PV-Loss Difference0.0182
5
Policy Optimization in CMAB785 2
Mean Diff from Supervised (PV-loss)0.0111
5
Policy Optimization in CMAB874 2
PV-loss (Mean Difference)0.094
5
Policy Optimization in CMAB1006 2
Mean Diff (PV-loss)0.1277
5
Policy Optimization in CMAB1073_2
Mean PV-loss Difference-0.0427
5
Policy Optimization in CMAB848_2
Mean Diff from Supervised (PV-loss)0.0895
5
Policy Optimization in CMAB1015 2
PV-Loss Difference0.0042
5
Showing 10 of 24 rows

Other info

Follow for update