Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

MedVLThinker: Simple Baselines for Multimodal Medical Reasoning

About

Large Reasoning Models (LRMs) have introduced a new paradigm in AI by enabling models to ``think before responding" via chain-of-thought reasoning. However, the absence of open and reproducible recipes for building reasoning-centric medical LMMs hinders community-wide research, analysis, and comparison. In this paper, we present MedVLThinker, a suite of simple yet strong baselines. Our fully open recipe consists of: (1) systematic data curation for both text-only and image-text medical data, filtered according to varying levels of reasoning difficulty, and (2) two training paradigms: Supervised Fine-Tuning (SFT) on distilled reasoning traces and Reinforcement Learning with Verifiable Rewards (RLVR) based on final answer correctness. Across extensive experiments on the Qwen2.5-VL model family (3B, 7B) and six medical QA benchmarks, we find that RLVR consistently and significantly outperforms SFT. Additionally, under the RLVR framework, a key, counter-intuitive finding is that training on our curated text-only reasoning data provides a more substantial performance boost than training on multimodal image-text data. Our best open 7B model, trained using the RLVR recipe on text-only data, establishes a new state-of-the-art on existing public VQA benchmarks, surpassing all previous open-source medical LMMs. Furthermore, scaling our model to 32B achieves performance on par with the proprietary GPT-4o. We release all curated data, models, and code to provide the community with a strong, open foundation for future research in multimodal medical reasoning.

Xiaoke Huang, Juncheng Wu, Hui Liu, Xianfeng Tang, Yuyin Zhou• 2025

Related benchmarks

TaskDatasetResultRank
Multimodal UnderstandingMMMU
Accuracy56.86
275
Medical Visual Question AnsweringSlake
Accuracy73.96
134
Medical Visual Question AnsweringVQA-RAD
Accuracy76.96
106
Medical Visual Question AnsweringPathVQA
Overall Accuracy68.82
86
Visual Question AnsweringSlideBench-VQA TCGA
Microscopy Score48.43
32
Multi-modal Question AnsweringMedXpertQA-MM
Accuracy34.6
27
Visual Question AnsweringWSI-VQA
Overall Accuracy49.13
25
Visual Question AnsweringSlideBench-VQA BCNB
Overall36.42
25
Visual Question AnsweringPathMMU Tiny 1.0 (test)
Overall Accuracy46.01
24
Visual Question AnsweringPathMMU 1.0 (ALL test)
Overall Score44.23
22
Showing 10 of 19 rows

Other info

Follow for update