UniMedVL: Unifying Medical Multimodal Understanding and Generation through Observation-Knowledge-Analysis

About

Medical workflows routinely combine reading images with producing visual and textual outputs, making both image understanding and generation central to medical AI. Most existing systems, however, address these abilities in isolated models, losing the shared knowledge that a unified architecture could exploit. To bridge this gap, we present UniMedVL, the first unified medical model that seamlessly integrates multimodal understanding and generation capabilities within a single model without switching weights. We achieve this via a tailored progressive training pipeline where understanding and generation mutually reinforce each other. To effectively train UniMedVL, we curate UniMedVL-5M, the first large-scale medical dataset comprising over 5.6M instances across 8 medical imaging modalities, tailored for multimodal input-output tasks in unified medical understanding and generation. Experimental results demonstrate that UniMedVL achieves competitive performance on five medical understanding benchmarks. Crucially, UniMedVL natively supports diverse interleaved generation tasks, e.g., virtual staining, super-resolution, cross-modal synthesis, essential for complex medical workflows. Our code and dataset are publicly available.

Junzhi Ning, Wei Li, Cheng Tang, Jiashi Lin, Chenglong Ma, Chaoyang Zhang, Jiyao Liu, Ying Chen, Shujian Gao, Yuandong Pu, Huihui Xu, Chenhui Gou, Ziyan Huang, Yi Xin, Qi Qin, Diping Song, Bin Fu, Guang Yang, Yuanfeng Ji, Tianbin Li, Yanzhou Su, Jin Ye, Shixiang Tang, Zhongying Deng, Lihao Liu, Ming Hu, Junjun He• 2025

Related benchmarks

Task	Dataset	Result
Medical Visual Question Answering	VQA-RAD	Accuracy61.9	251
Medical Image Segmentation	BUSI	Dice Score14.87	143
Medical Image Segmentation	GLAS	Dice52.86	134
Medical Image Segmentation	ISIC	DICE48.62	114
Medical Image Synthesis	BraTS	SSIM81.89	108
Medical Visual Question Answering	PathVQA	Accuracy53.5	103
Medical Report Generation	MIMIC-CXR (test)	ROUGE-L0.2727	100
Medical Image Segmentation	Kvasir	mDice35.55	60
Medical Image Segmentation	REFUGE	Dice Score0.5586	52
Medical Visual Question Answering	OmniMedVQA	Accuracy85.8	48

Showing 10 of 89 rows

...

Other info

Follow for update

@wizwand_team Discord