Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

V2T-CoT: From Vision to Text Chain-of-Thought for Medical Reasoning and Diagnosis

About

Recent advances in multimodal techniques have led to significant progress in Medical Visual Question Answering (Med-VQA). However, most existing models focus on global image features rather than localizing disease-specific regions crucial for diagnosis. Additionally, current research tends to emphasize answer accuracy at the expense of the reasoning pathway, yet both are crucial for clinical decision-making. To address these challenges, we propose From Vision to Text Chain-of-Thought (V2T-CoT), a novel approach that automates the localization of preference areas within biomedical images and incorporates this localization into region-level pixel attention as knowledge for Vision CoT. By fine-tuning the vision language model on constructed R-Med 39K dataset, V2T-CoT provides definitive medical reasoning paths. V2T-CoT integrates visual grounding with textual rationale generation to establish precise and explainable diagnostic results. Experimental results across four Med-VQA benchmarks demonstrate state-of-the-art performance, achieving substantial improvements in both performance and interpretability.

Yuan Wang, Jiaxiang Liu, Shujian Gao, Bin Feng, Zhihang Tang, Xiaotang Gai, Jian Wu, Zuozhu Liu• 2025

Related benchmarks

TaskDatasetResultRank
Medical Visual Question AnsweringSLAKE closed-end
Accuracy87.61
54
Medical Visual Question AnsweringVQA-RAD closed-end
Accuracy84.86
45
Medical Visual Question AnsweringPathVQA closed-end
Accuracy91.42
35
Showing 3 of 3 rows

Other info

Code

Follow for update