UniMedVL: Unifying Medical Multimodal Understanding and Generation through Observation-Knowledge-Analysis
About
Medical workflows routinely combine reading images with producing visual and textual outputs, making both image understanding and generation central to medical AI. Most existing systems, however, address these abilities in isolated models, losing the shared knowledge that a unified architecture could exploit. To bridge this gap, we present UniMedVL, the first unified medical model that seamlessly integrates multimodal understanding and generation capabilities within a single model without switching weights. We achieve this via a tailored progressive training pipeline where understanding and generation mutually reinforce each other. To effectively train UniMedVL, we curate UniMedVL-5M, the first large-scale medical dataset comprising over 5.6M instances across 8 medical imaging modalities, tailored for multimodal input-output tasks in unified medical understanding and generation. Experimental results demonstrate that UniMedVL achieves competitive performance on five medical understanding benchmarks. Crucially, UniMedVL natively supports diverse interleaved generation tasks, e.g., virtual staining, super-resolution, cross-modal synthesis, essential for complex medical workflows. Our code and dataset are publicly available.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Medical Visual Question Answering | VQA-RAD | Accuracy61.9 | 228 | |
| Medical Image Segmentation | BUSI | Dice Score14.87 | 134 | |
| Medical Image Synthesis | BraTS | SSIM81.89 | 108 | |
| Medical Image Segmentation | GLAS | Dice52.86 | 106 | |
| Medical Report Generation | MIMIC-CXR (test) | ROUGE-L0.2727 | 100 | |
| Medical Visual Question Answering | PathVQA | Accuracy53.5 | 80 | |
| Medical Image Segmentation | ISIC | DICE48.62 | 79 | |
| Medical Image Segmentation | REFUGE | Dice Score0.5586 | 52 | |
| Medical Image Segmentation | Kvasir | mDice35.55 | 49 | |
| Medical Visual Question Answering | OmniMedVQA | Accuracy85.8 | 48 |