Bootstrapping Fitted Q-Evaluation for Off-Policy Inference
About
Bootstrapping provides a flexible and effective approach for assessing the quality of batch reinforcement learning, yet its theoretical property is less understood. In this paper, we study the use of bootstrapping in off-policy evaluation (OPE), and in particular, we focus on the fitted Q-evaluation (FQE) that is known to be minimax-optimal in the tabular and linear-model cases. We propose a bootstrapping FQE method for inferring the distribution of the policy evaluation error and show that this method is asymptotically efficient and distributionally consistent for off-policy statistical inference. To overcome the computation limit of bootstrapping, we further adapt a subsampling procedure that improves the runtime by an order of magnitude. We numerically evaluate the bootrapping method in classical RL environments for confidence interval estimation, estimating the variance of off-policy evaluator, and estimating the correlation between multiple off-policy evaluators.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Empirical Coverage Estimation | RiverSwim | Q^π(1, 0)0.932 | 120 | |
| Empirical Coverage Estimation | RiverSwim episode length T = 10 (nominal 95% coverage) | Q* (1, 0)86.2 | 20 | |
| Empirical Coverage Estimation | RiverSwim T=50 90% nominal coverage | Q* (1, 0)84.3 | 20 | |
| Optimal Policy Recovery (Empirical Coverage) | RiverSwim T=50 nominal 95% coverage | Q* Recovery (s=1, a=0)90.5 | 20 | |
| Action-Value coverage estimation | RiverSwim mostly-right target policy T=50 | Q-Value Estimate (s=1, a=0)0.474 | 20 | |
| State-Value coverage estimation | RiverSwim mostly-right target policy T=50 | V(s=1)0.474 | 20 | |
| Off-policy Evaluation | RiverSwim mostly-left policy, T=50 | Qπ(1, 0) Coverage50 | 20 | |
| State Value Estimation Coverage | RiverSwim | Value Estimate State 10.931 | 20 | |
| State-Action Value Estimation Coverage | RiverSwim | Q-Value Estimate (s=1, a=0)0.931 | 20 | |
| Action-Value coverage estimation | RiverSwim T=100 | Q*(1,0)0.854 | 15 |