MedVLThinker: Simple Baselines for Multimodal Medical Reasoning
About
Large Reasoning Models (LRMs) have introduced a new paradigm in AI by enabling models to ``think before responding" via chain-of-thought reasoning. However, the absence of open and reproducible recipes for building reasoning-centric medical LMMs hinders community-wide research, analysis, and comparison. In this paper, we present MedVLThinker, a suite of simple yet strong baselines. Our fully open recipe consists of: (1) systematic data curation for both text-only and image-text medical data, filtered according to varying levels of reasoning difficulty, and (2) two training paradigms: Supervised Fine-Tuning (SFT) on distilled reasoning traces and Reinforcement Learning with Verifiable Rewards (RLVR) based on final answer correctness. Across extensive experiments on the Qwen2.5-VL model family (3B, 7B) and six medical QA benchmarks, we find that RLVR consistently and significantly outperforms SFT. Additionally, under the RLVR framework, a key, counter-intuitive finding is that training on our curated text-only reasoning data provides a more substantial performance boost than training on multimodal image-text data. Our best open 7B model, trained using the RLVR recipe on text-only data, establishes a new state-of-the-art on existing public VQA benchmarks, surpassing all previous open-source medical LMMs. Furthermore, scaling our model to 32B achieves performance on par with the proprietary GPT-4o. We release all curated data, models, and code to provide the community with a strong, open foundation for future research in multimodal medical reasoning.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Multimodal Understanding | MMMU | Accuracy56.86 | 275 | |
| Medical Visual Question Answering | Slake | Accuracy73.96 | 134 | |
| Medical Visual Question Answering | VQA-RAD | Accuracy76.96 | 106 | |
| Medical Visual Question Answering | PathVQA | Overall Accuracy68.82 | 86 | |
| Visual Question Answering | SlideBench-VQA TCGA | Microscopy Score48.43 | 32 | |
| Multi-modal Question Answering | MedXpertQA-MM | Accuracy34.6 | 27 | |
| Visual Question Answering | WSI-VQA | Overall Accuracy49.13 | 25 | |
| Visual Question Answering | SlideBench-VQA BCNB | Overall36.42 | 25 | |
| Visual Question Answering | PathMMU Tiny 1.0 (test) | Overall Accuracy46.01 | 24 | |
| Visual Question Answering | PathMMU 1.0 (ALL test) | Overall Score44.23 | 22 |