Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

MEDVISTAGYM: A Scalable Training Environment for Thinking with Medical Images via Tool-Integrated Reinforcement Learning

About

Vision language models (VLMs) achieve strong performance on general image understanding but struggle to think with medical images, especially when performing multi-step reasoning through iterative visual interaction. Medical VLMs often rely on static visual embeddings and single-pass inference, preventing models from re-examining, verifying, or refining visual evidence during reasoning. While tool-integrated reasoning offers a promising path forward, open-source VLMs lack the training infrastructure to learn effective tool selection, invocation, and coordination in multi-modal medical reasoning. We introduce MedVistaGym, a scalable and interactive training environment that incentivizes tool-integrated visual reasoning for medical image analysis. MedVistaGym equips VLMs to determine when and which tools to invoke, localize task-relevant image regions, and integrate single or multiple sub-image evidence into interleaved multimodal reasoning within a unified, executable interface for agentic training. Using MedVistaGym, we train MedVistaGym-R1 to interleave tool use with agentic reasoning through trajectory sampling and end-to-end reinforcement learning. Across six medical VQA benchmarks, MedVistaGym-R1-8B exceeds comparably sized tool-augmented baselines by 19.10% to 24.21%, demonstrating that structured agentic training--not tool access alone--unlocks effective tool-integrated reasoning for medical image analysis.

Meng Lu, Yuxing Lu, Yuchen Zhuang, Megan Mullins, Yang Xie, Guanghua Xiao, Charles Fleming, Wenqi Shi, Xuan Wang• 2026

Related benchmarks

TaskDatasetResultRank
Medical Visual Question AnsweringSlake
Accuracy81.36
134
Medical Visual Question AnsweringVQA-RAD
Accuracy70.75
106
Medical Visual Question AnsweringPathVQA
Overall Accuracy69
86
Medical Visual Question AnsweringPMC-VQA
Accuracy58
44
Medical Visual Question AnsweringMMMU H&M
Accuracy0.5643
25
Medical Visual Question AnsweringMicroVQA
Overall Accuracy43
25
Showing 6 of 6 rows

Other info

Follow for update