Joint Treatment Effect Estimation from Incomplete Healthcare Data: Temporal Causal Normalizing Flows with LLM-driven Evolutionary MNAR Imputation

About

Target trial emulation (TTE) enables causal questions to be studied with observational data when randomized controlled trials (RCTs) are infeasible. Yet treatment-effect methods often address causal estimation, missingness, and temporal structure separately, limiting their robustness in electronic health records (EHRs), where time-varying confounding and missing-not-at-random (MNAR) biomarkers can reach 50%--80%. We propose a two-stage pipeline for treatment effect estimation from incomplete longitudinal EHRs. First, CausalFlow-T, a directed acyclic graph (DAG)-constrained normalizing flow with long short-term memory (LSTM)-encoded patient history, performs exact invertible counterfactual inference, avoiding approximation errors from variational inference and separating confounding through explicit causal structure. Ablations on four synthetic and one semi-synthetic benchmark with known counterfactuals show that DAG constraints and exact inference address distinct failure modes: neither compensates for the other. Second, because CausalFlow-T requires completed inputs, we introduce an LLM-driven evolutionary imputer that proposes executable imputation operators rather than individual entries, and evaluate it with three large language model (LLM) backends, including two open-source models. Across 30%--80% MNAR missingness, this imputer achieves the best pooled rank over biomarker and causal metrics, leading in point-wise accuracy and temporal extrapolation while preserving average treatment effect (ATE) recovery as statistical baselines degrade. On Swiss primary-care EHRs from adults with type 2 diabetes initiating a GLP-1 receptor agonist or SGLT-2 inhibitor, the pipeline estimates a per-protocol weight-loss difference of -0.98 kg [95% CI -1.01, -0.96] favoring GLP-1 receptor agonists, consistent with randomized evidence and obtained from realistically incomplete real-world EHRs.

Olivia Jullian Parra, Sara Zoccheddu, David Catalan Cerezo, Tom Forzy, Franziska Ulrich, William Sutcliffe, Jakob Martin Burgstaller, Oliver Senn, Patrick Owen, Nicola Serra• 2026

Related benchmarks

Task	Dataset	Result
Biomarker-level imputation	Semi-synthetic biomarkers 30% missingness	MAE1.914	7
Biomarker-level imputation	Semi-synthetic biomarkers 50% missingness	MAE2.052	7
Biomarker-level imputation	Semi-synthetic biomarkers 80% missingness	MAE (Mean Absolute Error)2.059	7
Imputation	Semi-synthetic EHR dataset MNAR 30% (test)	MAE1	6
Imputation	Semi-synthetic EHR dataset MNAR 50% (test)	MAE1	6
Imputation	Semi-synthetic EHR dataset MNAR 80% (test)	Mean Absolute Error (MAE)1	6
Imputation	Semi-synthetic EHR dataset Pooled 30-80% MNAR (summary)	Mean Rank2.38	6
Causal Estimation	CVD Risk synthetic dataset	MAE (rank)1	5
Reliability Assessment	Aggregate Synthetic Datasets	Mean Rank1.83	5
Causal Estimation	LDL synthetic dataset	MAE (rank)2	5

Showing 10 of 12 rows

Other info

Follow for update

@wizwand_team Discord