Counterfactual Off-Policy Evaluation with Gumbel-Max Structural Causal Models
About
We introduce an off-policy evaluation procedure for highlighting episodes where applying a reinforcement learned (RL) policy is likely to have produced a substantially different outcome than the observed policy. In particular, we introduce a class of structural causal models (SCMs) for generating counterfactual trajectories in finite partially observable Markov Decision Processes (POMDPs). We see this as a useful procedure for off-policy "debugging" in high-risk settings (e.g., healthcare); by decomposing the expected difference in reward between the RL and observed policy into specific episodes, we can identify episodes where the counterfactual difference in reward is most dramatic. This in turn can be used to facilitate review of specific episodes by domain experts. We demonstrate the utility of this procedure with a synthetic environment of sepsis management.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Counterfactual Policy Evaluation | GridWorld p = 0.4 | Worst-Case Counterfactual V(s0)230 | 2 | |
| Counterfactual Policy Evaluation | Aircraft | Avg Worst-Case V(s0)221 | 2 | |
| Counterfactual Policy Evaluation | Sepsis Slightly Suboptimal Path | Lowest Cumulative Reward8.00e+3 | 2 | |
| Counterfactual Policy Evaluation | Sepsis Catastrophic Path | Lowest Cumulative Reward-9.05e+3 | 2 | |
| Counterfactual Policy Evaluation | GridWorld p = 0.9 | Avg Worst-Case Counterfactual V(s0)304 | 2 | |
| Counterfactual Policy Evaluation | Sepsis | Average Worst-Case Counterfactual V(s0)85.4 | 2 | |
| Counterfactual Policy Evaluation | Frozen Lake | Average Worst-Case V(s0)2.56 | 2 | |
| Counterfactual Policy Evaluation | GridWorld (p = 0.9) Slightly Suboptimal Path | Lowest Cumulative Reward-495 | 2 | |
| Counterfactual Policy Evaluation | GridWorld (p = 0.9) - Almost Catastrophic | Cumulative Reward (Lowest)-698 | 2 | |
| Counterfactual Policy Evaluation | GridWorld (p = 0.9) Catastrophic Path | Lowest Cumulative Reward-698 | 2 |