Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Citrus-V: Advancing Medical Foundation Models with Unified Medical Image Grounding for Clinical Reasoning

About

Medical imaging provides critical evidence for clinical diagnosis, treatment planning, and surgical decisions, yet most existing imaging models are narrowly focused and require multiple specialized networks, limiting their generalization. Although large-scale language and multimodal models exhibit strong reasoning and multi-task capabilities, real-world clinical applications demand precise visual grounding, multimodal integration, and chain-of-thought reasoning. We introduce Citrus-V, a multimodal medical foundation model that combines image analysis with textual reasoning. The model integrates detection, segmentation, and multimodal chain-of-thought reasoning, enabling pixel-level lesion localization, structured report generation, and physician-like diagnostic inference in a single framework. We propose a novel multimodal training approach and release a curated open-source data suite covering reasoning, detection, segmentation, and document understanding tasks. Evaluations demonstrate that Citrus-V outperforms existing open-source medical models and expert-level imaging systems across multiple benchmarks, delivering a unified pipeline from visual grounding to clinical reasoning and supporting precise lesion quantification, automated reporting, and reliable second opinions.

Guoxin Wang, Jun Zhao, Xinyi Liu, Yanbo Liu, Xuyang Cao, Chao Li, Zhuoyun Liu, Qintian Sun, Fangru Zhou, Haoqiang Xing, Zhenhong Yang• 2025

Related benchmarks

TaskDatasetResultRank
Interactive Medical Image SegmentationFundus
Dice0.782
16
Interactive Medical Image SegmentationX-Ray
Dice0.624
16
Interactive Medical Image SegmentationEndoscopy
Dice Coefficient66.6
16
Interactive Medical Image SegmentationCT (Computed Tomography)
Dice Coefficient36.3
16
Interactive Medical Image SegmentationMRI (Magnetic Resonance Imaging)
Dice0.326
16
Interactive Medical Image SegmentationAverage across 6 medical imaging modalities
Dice48.2
16
Interactive Medical Image SegmentationUltrasound
Dice13.4
16
DiagnosisTN5K
Accuracy74.7
14
DiagnosisBUS-CoT
Accuracy46.67
14
Medical lesion detectionBUS-CoT (test)
Precision5.72
14
Showing 10 of 15 rows

Other info

Follow for update