Doubly robust off-policy evaluation with shrinkage

About

We propose a new framework for designing estimators for off-policy evaluation in contextual bandits. Our approach is based on the asymptotically optimal doubly robust estimator, but we shrink the importance weights to minimize a bound on the mean squared error, which results in a better bias-variance tradeoff in finite samples. We use this optimization-based framework to obtain three estimators: (a) a weight-clipping estimator, (b) a new weight-shrinkage estimator, and (c) the first shrinkage-based estimator for combinatorial action sets. Extensive experiments in both standard and combinatorial bandit benchmark problems show that our estimators are highly adaptive and typically outperform state-of-the-art methods.

Yi Su, Maria Dimakopoulou, Akshay Krishnamurthy, Miroslav Dud\'ik• 2019

Related benchmarks

Task	Dataset	Result
Off-Policy Learning	Wiki10-31K Synthetic tau=0.5 (test)	P@50.5526	14
Off-Policy Learning	Wiki10-31K Synthetic tau=1 (test)	P@554.99	14
Off-Policy Learning	Wiki10-31K Synthetic tau=2 (test)	P@50.5347	14
Recommendation	Yahoo! R3 (test)	P@528.43	13
Recommendation	Coat (test)	Precision@50.279	13
Recommendation	KuaiRec (test)	Precision@5087.44	13
Off-policy Evaluation	Digits (UCI)	MSE0.0384	12
Off-policy Evaluation	PenDigits (UCI)	MSE0.0138	6
Off-policy Evaluation	SatImage (UCI)	MSE0.0078	6
Off-policy Evaluation	Letter (UCI)	MSE0.2363	6

Showing 10 of 16 rows

Other info

Follow for update

@wizwand_team Discord