Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Generative Modeling under Non-Monotonic MAR Missingness via Approximate Wasserstein Gradient Flows

About

The prevalence of missing values in data science poses a substantial risk to any further analyses. Despite a wealth of research, principled nonparametric methods to deal with general non-monotone missingness are still scarce. Instead, ad-hoc imputation methods are often used, for which it remains unclear whether the correct distribution can be recovered. In this paper, we propose FLOWGEM, a principled iterative method for generating a complete dataset from a dataset with values Missing at Random (MAR). Motivated by convergence results of the ignoring maximum likelihood estimator, our approach minimizes the expected Kullback-Leibler (KL) divergence between the observed data distribution and the distribution of the generated sample over different missingness patterns. To minimize the KL divergence, we employ a discretized particle evolution of the corresponding Wasserstein Gradient Flow, where the velocity field is approximated using a local linear estimator of the density ratio. This construction yields a data generation scheme that iteratively transports an initial particle ensemble toward the target distribution. Simulation studies and real-data benchmarks demonstrate that FLOWGEM achieves state-of-the-art performance across a range of settings, including the challenging case of non-monotonic MAR mechanisms. Together, these results position FLOWGEM as a principled and practical alternative to existing imputation methods, and a decisive step towards closing the gap between theoretical rigor and empirical performance.

Gitte Kremling, Jeffrey N\"af, Johannes Lederer• 2026

Related benchmarks

TaskDatasetResultRank
Sample GenerationConcrete
Standardized Energy Distance5.84
8
Sample GenerationForest
Standardized Energy Distance3.99
8
Sample GenerationHousing
Standardized Energy Distance7.8
8
Sample GenerationStock
Standardized Energy Distance4.7
8
Sample Generationwindspeed
Standardized Energy Distance2.03
8
Tabular Synthetic Data GenerationParkinsons--
8
Sample Generationallergens
Standardized Energy Distance34.29
7
Sample GenerationSCM20d
Standardized Energy Distance17.65
7
Sample GenerationSCM1d
Standardized energy distance26.26
7
Sample Generationpumadyn32nm
Standardized Energy Distance11.41
7
Showing 10 of 10 rows

Other info

Follow for update