Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Robust Counterfactual Inference in Markov Decision Processes

About

This paper addresses a key limitation in existing counterfactual inference methods for Markov Decision Processes (MDPs). Current approaches assume a specific causal model to make counterfactuals identifiable. However, there are usually many causal models that align with the observational and interventional distributions of an MDP, each yielding different counterfactual distributions, so fixing a particular causal model limits the validity (and usefulness) of counterfactual inference. We propose a novel non-parametric approach that computes tight bounds on counterfactual transition probabilities across all compatible causal models. Unlike previous methods that require solving prohibitively large optimisation problems (with variables that grow exponentially in the size of the MDP), our approach provides closed-form expressions for these bounds, making computation highly efficient and scalable for non-trivial MDPs. Once such an interval counterfactual MDP is constructed, our method identifies robust counterfactual policies that optimise the worst-case reward w.r.t. the uncertain interval MDP probabilities. We evaluate our method on various case studies, demonstrating improved robustness over existing methods.

Jessica Lally, Milad Kazemi, Nicola Paoletti• 2025

Related benchmarks

TaskDatasetResultRank
Counterfactual Policy EvaluationGridWorld p = 0.9
Avg Worst-Case Counterfactual V(s0)346
2
Counterfactual Policy EvaluationSepsis
Average Worst-Case Counterfactual V(s0)1.66e+3
2
Counterfactual Policy EvaluationFrozen Lake
Average Worst-Case V(s0)37.3
2
Counterfactual Policy EvaluationGridWorld (p = 0.9) Slightly Suboptimal Path
Lowest Cumulative Reward-495
2
Counterfactual Policy EvaluationGridWorld (p = 0.9) - Almost Catastrophic
Cumulative Reward (Lowest)-697
2
Counterfactual Policy EvaluationGridWorld (p = 0.9) Catastrophic Path
Lowest Cumulative Reward-698
2
Counterfactual Policy EvaluationGridWorld p = 0.4 Slightly Suboptimal Path
Lowest Cumulative Reward19
2
Counterfactual Policy EvaluationGridWorld (p = 0.4) - Almost Catastrophic
Lowest Cumulative Reward14
2
Counterfactual Policy EvaluationGridWorld p = 0.4 Catastrophic Path
Lowest Cumulative Reward-698
2
Counterfactual Policy EvaluationSepsis Almost Catastrophic
Lowest Cumulative Reward100
2
Showing 10 of 25 rows

Other info

Follow for update