Optimal and Adaptive Off-policy Evaluation in Contextual Bandits

About

We study the off-policy evaluation problem---estimating the value of a target policy using data collected by another policy---under the contextual bandit model. We consider the general (agnostic) setting without access to a consistent model of rewards and establish a minimax lower bound on the mean squared error (MSE). The bound is matched up to constants by the inverse propensity scoring (IPS) and doubly robust (DR) estimators. This highlights the difficulty of the agnostic contextual setting, in contrast with multi-armed bandits and contextual bandits with access to a consistent reward model, where IPS is suboptimal. We then propose the SWITCH estimator, which can use an existing reward model (not necessarily consistent) to achieve a better bias-variance tradeoff than IPS and DR. We prove an upper bound on its MSE and demonstrate its benefits empirically on a diverse collection of data sets, often outperforming prior work by orders of magnitude.

Yu-Xiang Wang, Alekh Agarwal, Miroslav Dudik• 2016

Related benchmarks

Task	Dataset	Result
Off-policy Evaluation	Digits (UCI)	MSE0.0557	12
Average Treatment Effect Estimation	Twins n=200	MAE (eATE)0.065	6
Off-policy Evaluation	PenDigits (UCI)	MSE0.0342	6
Off-policy Evaluation	SatImage (UCI)	MSE0.0136	6
Average Treatment Effect Estimation	Twins (n=1600)	Mean Absolute ATE Error (eATE)0.071	6
Average Treatment Effect Estimation	Twins (n=3200)	MAE (eATE)0.069	6
Off-policy Evaluation	Letter (UCI)	MSE0.2387	6
Off-policy Evaluation	CIFAR-100	MSE1.1644	6
Off-policy Evaluation	MNIST	MSE0.275	6
Average Treatment Effect Estimation	Twins (n=50)	Mean Absolute ATE Error (eATE)0.101	6

Showing 10 of 10 rows

Other info

Follow for update

@wizwand_team Discord