V2T-CoT: From Vision to Text Chain-of-Thought for Medical Reasoning and Diagnosis

About

Recent advances in multimodal techniques have led to significant progress in Medical Visual Question Answering (Med-VQA). However, most existing models focus on global image features rather than localizing disease-specific regions crucial for diagnosis. Additionally, current research tends to emphasize answer accuracy at the expense of the reasoning pathway, yet both are crucial for clinical decision-making. To address these challenges, we propose From Vision to Text Chain-of-Thought (V2T-CoT), a novel approach that automates the localization of preference areas within biomedical images and incorporates this localization into region-level pixel attention as knowledge for Vision CoT. By fine-tuning the vision language model on constructed R-Med 39K dataset, V2T-CoT provides definitive medical reasoning paths. V2T-CoT integrates visual grounding with textual rationale generation to establish precise and explainable diagnostic results. Experimental results across four Med-VQA benchmarks demonstrate state-of-the-art performance, achieving substantial improvements in both performance and interpretability.

Yuan Wang, Jiaxiang Liu, Shujian Gao, Bin Feng, Zhihang Tang, Xiaotang Gai, Jian Wu, Zuozhu Liu• 2025

Related benchmarks

Task	Dataset	Result
Medical Visual Question Answering	SLAKE closed-end	Accuracy87.61	54
Medical Visual Question Answering	VQA-RAD closed-end	Accuracy84.86	45
Medical Visual Question Answering	PathVQA closed-end	Accuracy91.42	35

Showing 3 of 3 rows

Other info

Code

Follow for update

@wizwand_team Discord