Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

VTool-R1: VLMs Learn to Think with Images via Reinforcement Learning on Multimodal Tool Use

About

Reinforcement Learning Finetuning (RFT) has significantly advanced the reasoning capabilities of large language models (LLMs) by enabling long chains of thought, self-correction, and effective tool use. While recent works attempt to extend RFT to vision-language models (VLMs), these efforts largely produce text-only reasoning conditioned on static image inputs, falling short of true multimodal reasoning in the response. In contrast, test-time methods like Visual Sketchpad incorporate visual steps but lack training mechanisms. We introduce VTool-R1, the first framework that trains VLMs to generate multimodal chains of thought by interleaving text and intermediate visual reasoning steps. VTool-R1 integrates Python-based visual editing tools into the RFT process, enabling VLMs to learn when and how to generate visual reasoning steps that benefit final reasoning. Trained with outcome-based rewards tied to task accuracy, our approach elicits strategic visual tool use for reasoning without relying on process-based supervision. Experiments on structured visual question answering over charts and tables show that VTool-R1 enhances reasoning performance by teaching VLMs to "think with images" and generate multimodal chain of thoughts with tools. To support future research in multi-turn multi-modal reasoning, we open-source our code at https://github.com/VTOOL-R1/vtool-r1

Mingyuan Wu, Jingcheng Yang, Jize Jiang, Meitang Li, Kaizhuo Yan, Hanchao Yu, Minjia Zhang, Chengxiang Zhai, Klara Nahrstedt• 2025

Related benchmarks

TaskDatasetResultRank
Optical Character RecognitionOCRBench--
433
High-Resolution Visual PerceptionHR-Bench-4K
Accuracy68.5
79
CountingTallyQA
Accuracy79.4
67
High-Resolution Visual PerceptionHR-Bench-8K
Accuracy66.4
63
Document Visual Question AnsweringDocVQA v1.0 (test)--
49
Multimodal Mathematical ReasoningMathVision
Pass@1 Accuracy29.3
31
Visual ReasoningVLMs are Blind
Accuracy48.4
28
Medical Visual Question AnsweringSlake
Accuracy60.7
25
Visual Programming and ReasoningHumanEval_V
Accuracy4
22
Multimodal ReasoningERQA
Accuracy36.8
22
Showing 10 of 20 rows

Other info

Follow for update