Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Doubly robust off-policy evaluation with shrinkage

About

We propose a new framework for designing estimators for off-policy evaluation in contextual bandits. Our approach is based on the asymptotically optimal doubly robust estimator, but we shrink the importance weights to minimize a bound on the mean squared error, which results in a better bias-variance tradeoff in finite samples. We use this optimization-based framework to obtain three estimators: (a) a weight-clipping estimator, (b) a new weight-shrinkage estimator, and (c) the first shrinkage-based estimator for combinatorial action sets. Extensive experiments in both standard and combinatorial bandit benchmark problems show that our estimators are highly adaptive and typically outperform state-of-the-art methods.

Yi Su, Maria Dimakopoulou, Akshay Krishnamurthy, Miroslav Dud\'ik• 2019

Related benchmarks

TaskDatasetResultRank
Off-Policy LearningWiki10-31K Synthetic tau=0.5 (test)
P@50.5526
14
Off-Policy LearningWiki10-31K Synthetic tau=1 (test)
P@554.99
14
Off-Policy LearningWiki10-31K Synthetic tau=2 (test)
P@50.5347
14
RecommendationYahoo! R3 (test)
P@528.43
13
RecommendationCoat (test)
Precision@50.279
13
RecommendationKuaiRec (test)
Precision@5087.44
13
Off-policy EvaluationDigits (UCI)
MSE0.0384
12
Off-policy EvaluationPenDigits (UCI)
MSE0.0138
6
Off-policy EvaluationSatImage (UCI)
MSE0.0078
6
Off-policy EvaluationLetter (UCI)
MSE0.2363
6
Showing 10 of 16 rows

Other info

Follow for update