Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Counterfactual Off-Policy Evaluation with Gumbel-Max Structural Causal Models

About

We introduce an off-policy evaluation procedure for highlighting episodes where applying a reinforcement learned (RL) policy is likely to have produced a substantially different outcome than the observed policy. In particular, we introduce a class of structural causal models (SCMs) for generating counterfactual trajectories in finite partially observable Markov Decision Processes (POMDPs). We see this as a useful procedure for off-policy "debugging" in high-risk settings (e.g., healthcare); by decomposing the expected difference in reward between the RL and observed policy into specific episodes, we can identify episodes where the counterfactual difference in reward is most dramatic. This in turn can be used to facilitate review of specific episodes by domain experts. We demonstrate the utility of this procedure with a synthetic environment of sepsis management.

Michael Oberst, David Sontag• 2019

Related benchmarks

TaskDatasetResultRank
Counterfactual Policy EvaluationGridWorld p = 0.4
Worst-Case Counterfactual V(s0)230
2
Counterfactual Policy EvaluationAircraft
Avg Worst-Case V(s0)221
2
Counterfactual Policy EvaluationSepsis Slightly Suboptimal Path
Lowest Cumulative Reward8.00e+3
2
Counterfactual Policy EvaluationSepsis Catastrophic Path
Lowest Cumulative Reward-9.05e+3
2
Counterfactual Policy EvaluationGridWorld p = 0.9
Avg Worst-Case Counterfactual V(s0)304
2
Counterfactual Policy EvaluationSepsis
Average Worst-Case Counterfactual V(s0)85.4
2
Counterfactual Policy EvaluationFrozen Lake
Average Worst-Case V(s0)2.56
2
Counterfactual Policy EvaluationGridWorld (p = 0.9) Slightly Suboptimal Path
Lowest Cumulative Reward-495
2
Counterfactual Policy EvaluationGridWorld (p = 0.9) - Almost Catastrophic
Cumulative Reward (Lowest)-698
2
Counterfactual Policy EvaluationGridWorld (p = 0.9) Catastrophic Path
Lowest Cumulative Reward-698
2
Showing 10 of 25 rows

Other info

Follow for update