Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Model-based Bootstrap of Controlled Markov Chains

About

We propose and analyze a model-based bootstrap for transition kernels in finite controlled Markov chains (CMCs) with possibly nonstationary or history-dependent control policies, a setting that arises naturally in offline reinforcement learning (RL) when the behavior policy generating the data is unknown. We establish distributional consistency of the bootstrap transition estimator in both a single long-chain regime and the episodic offline RL regime. The key technical tools are a novel bootstrap law of large numbers (LLN) for the visitation counts and a novel use of the martingale central limit theorem (CLT) for the bootstrap transition increments. We extend bootstrap distributional consistency to the downstream targets of offline policy evaluation (OPE) and optimal policy recovery (OPR) via the delta method by verifying Hadamard differentiability of the Bellman operators, yielding asymptotically valid confidence intervals for value and $Q$-functions. Experiments on the RiverSwim problem show that the proposed bootstrap confidence intervals (CIs), especially the percentile CIs, outperform the episodic bootstrap and plug-in CLT CIs, and are often close to nominal ($50\%$, $90\%$, $95\%$) coverage, while the baselines are poorly calibrated at small sample sizes and short episode lengths.

Ziwei Su, Imon Banerjee, Diego Klabjan• 2026

Related benchmarks

TaskDatasetResultRank
Empirical Coverage EstimationRiverSwim
Q^π(1, 0)0.947
120
Action-Value coverage estimationRiverSwim mostly-right target policy T=50
Q-Value Estimate (s=1, a=0)0.523
20
Empirical Coverage EstimationRiverSwim episode length T = 10 (nominal 95% coverage)
Q* (1, 0)89.9
20
Empirical Coverage EstimationRiverSwim T=50 90% nominal coverage
Q* (1, 0)91.3
20
Off-policy EvaluationRiverSwim mostly-left policy, T=50
Qπ(1, 0) Coverage55.7
20
Optimal Policy Recovery (Empirical Coverage)RiverSwim T=50 nominal 95% coverage
Q* Recovery (s=1, a=0)95.7
20
State Value Estimation CoverageRiverSwim
Value Estimate State 10.952
20
State-Action Value Estimation CoverageRiverSwim
Q-Value Estimate (s=1, a=0)0.952
20
State-Value coverage estimationRiverSwim mostly-right target policy T=50
V(s=1)0.523
20
Action-Value coverage estimationRiverSwim T=100
Q*(1,0)0.907
15
Showing 10 of 13 rows

Other info

Follow for update