Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Joint Treatment Effect Estimation from Incomplete Healthcare Data: Temporal Causal Normalizing Flows with LLM-driven Evolutionary MNAR Imputation

About

Target trial emulation (TTE) enables causal questions to be studied with observational data when randomized controlled trials (RCTs) are infeasible. Yet treatment-effect methods often address causal estimation, missingness, and temporal structure separately, limiting their robustness in electronic health records (EHRs), where time-varying confounding and missing-not-at-random (MNAR) biomarkers can reach 50%--80%. We propose a two-stage pipeline for treatment effect estimation from incomplete longitudinal EHRs. First, CausalFlow-T, a directed acyclic graph (DAG)-constrained normalizing flow with long short-term memory (LSTM)-encoded patient history, performs exact invertible counterfactual inference, avoiding approximation errors from variational inference and separating confounding through explicit causal structure. Ablations on four synthetic and one semi-synthetic benchmark with known counterfactuals show that DAG constraints and exact inference address distinct failure modes: neither compensates for the other. Second, because CausalFlow-T requires completed inputs, we introduce an LLM-driven evolutionary imputer that proposes executable imputation operators rather than individual entries, and evaluate it with three large language model (LLM) backends, including two open-source models. Across 30%--80% MNAR missingness, this imputer achieves the best pooled rank over biomarker and causal metrics, leading in point-wise accuracy and temporal extrapolation while preserving average treatment effect (ATE) recovery as statistical baselines degrade. On Swiss primary-care EHRs from adults with type 2 diabetes initiating a GLP-1 receptor agonist or SGLT-2 inhibitor, the pipeline estimates a per-protocol weight-loss difference of -0.98 kg [95% CI -1.01, -0.96] favoring GLP-1 receptor agonists, consistent with randomized evidence and obtained from realistically incomplete real-world EHRs.

Olivia Jullian Parra, Sara Zoccheddu, David Catalan Cerezo, Tom Forzy, Franziska Ulrich, William Sutcliffe, Jakob Martin Burgstaller, Oliver Senn, Patrick Owen, Nicola Serra• 2026

Related benchmarks

TaskDatasetResultRank
Biomarker-level imputationSemi-synthetic biomarkers 30% missingness
MAE1.914
7
Biomarker-level imputationSemi-synthetic biomarkers 50% missingness
MAE2.052
7
Biomarker-level imputationSemi-synthetic biomarkers 80% missingness
MAE (Mean Absolute Error)2.059
7
ImputationSemi-synthetic EHR dataset MNAR 30% (test)
MAE1
6
ImputationSemi-synthetic EHR dataset MNAR 50% (test)
MAE1
6
ImputationSemi-synthetic EHR dataset MNAR 80% (test)
Mean Absolute Error (MAE)1
6
ImputationSemi-synthetic EHR dataset Pooled 30-80% MNAR (summary)
Mean Rank2.38
6
Causal EstimationCVD Risk synthetic dataset
MAE (rank)1
5
Reliability AssessmentAggregate Synthetic Datasets
Mean Rank1.83
5
Causal EstimationLDL synthetic dataset
MAE (rank)2
5
Showing 10 of 12 rows

Other info

Follow for update