DISCO-TAB: A Hierarchical Reinforcement Learning Framework for Privacy-Preserving Synthesis of Complex Clinical Data
About
The development of robust clinical decision support systems is frequently impeded by the scarcity of high-fidelity, privacy-preserving biomedical data. While Generative Large Language Models (LLMs) offer a promising avenue for synthetic data generation, they often struggle to capture the complex, non-linear dependencies and severe class imbalances inherent in Electronic Health Records (EHR), leading to statistically plausible but clinically invalid records. To bridge this gap, we introduce DISCO-TAB (DIScriminator-guided COntrol for TABular synthesis), a novel framework that orchestrates a fine-tuned LLM with a multi-objective discriminator system optimized via Reinforcement Learning. Unlike prior methods relying on scalar feedback, DISCO-TAB evaluates synthesis at four granularities, token, sentence, feature, and row, while integrating Automated Constraint Discovery and Inverse-Frequency Reward Shaping to autonomously preserve latent medical logic and resolve minority-class collapse. We rigorously validate our framework across diverse benchmarks, including high-dimensional, small-sample medical datasets (e.g., Heart Failure, Parkinson's). Our results demonstrate that hierarchical feedback yields state-of-the-art performance, achieving up to 38.2% improvement in downstream clinical classifier utility compared to GAN and Diffusion baselines, while ensuring exceptional statistical fidelity (JSD < 0.01) and robust resistance to membership inference attacks. This work establishes a new standard for generating trustworthy, utility-preserving synthetic tabular data for sensitive healthcare applications.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Classification | German Credit | F1 Score92.5 | 15 | |
| Machine Learning | bank-marketing | F1 Score87.1 | 15 | |
| Downstream ML Utility | Heart Failure | F1-score100 | 8 | |
| Downstream ML Utility | Breast cancer | F1-score99.4 | 8 | |
| Downstream ML Utility | liver-disorders | F1-score99 | 8 | |
| Downstream ML Utility | Parkinsons | F1-score96.6 | 8 | |
| Downstream ML Utility | Obesity | F1-score92.9 | 8 | |
| Tabular Synthetic Data Generation | Heart Failure | KS Statistic0.022 | 8 | |
| Tabular Synthetic Data Generation | Breast cancer | KS Statistic0.025 | 8 | |
| Tabular Synthetic Data Generation | liver-disorders | KS Statistic0.017 | 8 |