Fleming-VL: Towards Universal Medical Visual Reasoning with Multimodal LLMs

About

Multimodal Large Language Models (MLLMs) have demonstrated remarkable effectiveness in various general-domain scenarios, such as visual question answering and image captioning. Recently, researchers have increasingly focused on empowering MLLMs with medical conversational abilities, which hold significant promise for clinical applications. However, medical data presents unique challenges due to its heterogeneous nature -- encompassing diverse modalities including 2D images, 3D volumetric scans, and temporal video sequences. The substantial domain gap and data format inconsistencies across these modalities have hindered the development of unified medical MLLMs. To address these challenges, we propose Fleming-VL, a unified end-to-end framework for comprehensive medical visual understanding across heterogeneous modalities. Fleming-VL tackles this problem from a data-centric perspective through three key strategies: (1) scaling up pretraining by integrating long-context data from both natural and medical-specific domains; (2) complementing fine-tuning with rare medical data, including holistic video analysis and underrepresented 2D modalities such as ultrasound and dermoscopy images; (3) extending existing evaluation frameworks to incorporate 3D volumetric and video understanding benchmarks. Through supervised fine-tuning (SFT) and group relative policy optimization (GRPO), we develop Fleming-VL in multiple model scales. Extensive experiments demonstrate that Fleming-VL achieves state-of-the-art performance across multiple benchmarks, including medical VQA, video QA, and 3D medical image understanding. We publicly release Fleming-VL to promote transparent, reproducible, and auditable progress in medical AI.

Yan Shu, Chi Liu, Robin Chen, Derek Li, Bryan Dai• 2025

Related benchmarks

Task	Dataset	Result
Medical Visual Question Answering	VQA-RAD	--	251
Medical Visual Question Answering	PathVQA	--	109
Medical Visual Question Answering	PMC-VQA	Accuracy61.3	103
Radiology Report Generation	CHEXPERT Plus	R-L26.1	37
Medical Visual Question Answering	Medical VQA Suite (MMMU-Med, VQA-RAD, SLAKE, PathVQA, PMC-VQA, OmniMedVQA, MedXpertQA)	MMMU-Med Score63.3	18
Medical Report Generation	IU-Xray	ROUGE-L44.9	17
Medical Question Answering	Medical Text QA Suite (MMLU-Med, PubMedQA, MedMCQA, MedQA, Medbullets, MedXpertQA, SGPQA)	MMLU-Med71.8	17
Medical VQA	MSD In-Domain (test)	Accuracy65.5	16
Medical VQA	BiomedParse In-Domain (test)	Accuracy61.49	16
Medical VQA	In-house (Held-out)	Accuracy73.6	16

Showing 10 of 22 rows

Other info

Follow for update

@wizwand_team Discord