Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

OmniV-Med: Scaling Medical Vision-Language Model for Universal Visual Understanding

About

The practical deployment of medical vision-language models (Med-VLMs) necessitates seamless integration of textual data with diverse visual modalities, including 2D/3D images and videos, yet existing models typically employ separate encoders for different modalities. To address this limitation, we present OmniV-Med, a unified framework for multimodal medical understanding. Our technical contributions are threefold: First, we construct OmniV-Med-Instruct, a comprehensive multimodal medical dataset containing 252K instructional samples spanning 14 medical image modalities and 11 clinical tasks. Second, we devise a rotary position-adaptive encoder that processes multi-resolution 2D/3D images and videos within a unified architecture, diverging from conventional modality-specific encoders. Third, we introduce a medical-aware token pruning mechanism that exploits spatial-temporal redundancy in volumetric data (e.g., consecutive CT slices) and medical videos, effectively reducing 60\% of visual tokens without performance degradation. Empirical evaluations demonstrate that OmniV-Med-7B achieves state-of-the-art performance on 7 benchmarks spanning 2D/3D medical imaging and video understanding tasks. Notably, our lightweight variant (OmniV-Med-1.5B) attains comparable performance while requiring only 8 RTX3090 GPUs for training and supporting efficient long-video inference. Data, code and model will be released.

Songtao Jiang, Yuan Wang, Sibo Song, Yan Zhang, Zijie Meng, Bohan Lei, Jian Wu, Jimeng Sun, Zuozhu Liu• 2025

Related benchmarks

TaskDatasetResultRank
Image Observation3D-RAD
BLEU16.42
9
Anomaly Detection3D-RAD
BLEU13.47
9
Longitudinal Temporal Diagnosis3D-RAD
Accuracy24.23
9
Existence Detection3D-RAD
Accuracy28.66
9
Medical Measurement3D-RAD
BLEU2.52
9
Static Temporal Diagnosis3D-RAD
Accuracy22.96
9
Showing 6 of 6 rows

Other info

Follow for update