Beyond Pooling: Matching for Robust Generalization under Data Heterogeneity

About

Pooling heterogeneous datasets across domains is a common strategy in representation learning, but naive pooling can amplify distributional asymmetries and yield biased estimators, especially in settings where zero-shot generalization is required. We propose a matching framework that selects samples relative to an adaptive centroid and iteratively refines the representation distribution. The double robustness and the propensity score matching for the inclusion of data domains make matching more robust than naive pooling and uniform subsampling by filtering out the confounding domains (the main cause of heterogeneity). Theoretical and empirical analyses show that, unlike naive pooling or uniform subsampling, matching achieves better results under asymmetric meta-distributions, which are also extended to non-Gaussian and multimodal real-world settings. Most importantly, we show that these improvements translate to zero-shot medical anomaly detection, one of the extreme forms of data heterogeneity and asymmetry. The code is available on https://github.com/AyushRoy2001/Beyond-Pooling.

Ayush Roy, Rudrasis Chakraborty, Lav Varshney, Vishnu Suresh Lokhande• 2026

Related benchmarks

Task	Dataset	Result
Anomaly Segmentation	RESC	AUC95.09	74
Anomaly Classification	LiverCT	AUC80.26	72
Anomaly Classification	RESC	AUC (%)89.42	68
Anomaly Classification	OCT 17	AUC97.21	54
Anomaly Classification	BrainMRI	AUC87.67	47
Anomaly Segmentation	LiverCT	AUC98.75	45
Anomaly Classification	HIS	AUC75.41	40
Anomaly Segmentation	BrainMRI	--	39
Anomaly Classification	ChestXray	AUC74.94	26

Showing 9 of 9 rows

Other info

Follow for update

@wizwand_team Discord