Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Fleming-VL: Towards Universal Medical Visual Reasoning with Multimodal LLMs

About

Multimodal Large Language Models (MLLMs) have demonstrated remarkable effectiveness in various general-domain scenarios, such as visual question answering and image captioning. Recently, researchers have increasingly focused on empowering MLLMs with medical conversational abilities, which hold significant promise for clinical applications. However, medical data presents unique challenges due to its heterogeneous nature -- encompassing diverse modalities including 2D images, 3D volumetric scans, and temporal video sequences. The substantial domain gap and data format inconsistencies across these modalities have hindered the development of unified medical MLLMs. To address these challenges, we propose Fleming-VL, a unified end-to-end framework for comprehensive medical visual understanding across heterogeneous modalities. Fleming-VL tackles this problem from a data-centric perspective through three key strategies: (1) scaling up pretraining by integrating long-context data from both natural and medical-specific domains; (2) complementing fine-tuning with rare medical data, including holistic video analysis and underrepresented 2D modalities such as ultrasound and dermoscopy images; (3) extending existing evaluation frameworks to incorporate 3D volumetric and video understanding benchmarks. Through supervised fine-tuning (SFT) and group relative policy optimization (GRPO), we develop Fleming-VL in multiple model scales. Extensive experiments demonstrate that Fleming-VL achieves state-of-the-art performance across multiple benchmarks, including medical VQA, video QA, and 3D medical image understanding. We publicly release Fleming-VL to promote transparent, reproducible, and auditable progress in medical AI.

Yan Shu, Chi Liu, Robin Chen, Derek Li, Bryan Dai• 2025

Related benchmarks

TaskDatasetResultRank
Radiology Report GenerationCHEXPERT Plus
R-L26.1
22
Medical Visual Question AnsweringMedical VQA Suite (MMMU-Med, VQA-RAD, SLAKE, PathVQA, PMC-VQA, OmniMedVQA, MedXpertQA)
MMMU-Med Score63.3
18
Medical Report GenerationIU-Xray
ROUGE-L44.9
17
Medical Question AnsweringMedical Text QA Suite (MMLU-Med, PubMedQA, MedMCQA, MedQA, Medbullets, MedXpertQA, SGPQA)
MMLU-Med71.8
17
Medical Report GenerationMed-Trinity
ROUGE-L13.1
8
Multi-view GroundingMedSG
IoU42
6
Object TrackingMedSG
IoU36.7
6
Referring Expression GroundingMedSG
IoU (%)16.6
6
Bacteria DetectionBacteria
IoU830
5
Lesion DetectionDeepLesion
IoU (%)0.00e+0
5
Showing 10 of 11 rows

Other info

Follow for update