Visual Reasoning through Tool-supervised Reinforcement Learning

About

In this paper, we investigate the problem of how to effectively master tool-use to solve complex visual reasoning tasks for Multimodal Large Language Models. To achieve that, we propose a novel Tool-supervised Reinforcement Learning (ToolsRL) framework, with direct tool supervision for more effective tool-use learning. We focus on a series of simple, native, and interpretable visual tools, including zoom-in, rotate, flip, and draw point/line, whose tool supervision is easy to collect. A reinforcement learning curriculum is developed, where the first stage is solely optimized by a set of well motivated tool-specific rewards, and the second stage is trained with the accuracy targeted rewards while allowing calling tools. In this way, tool calling capability is mastered before using tools to complete visual reasoning tasks, avoiding the potential optimization conflict among those heterogeneous tasks. Our experiments have shown that the tool-supervised curriculum training is efficient and ToolsRL can achieve strong tool-use capabilities for complex visual reasoning tasks.

Qihua Dong, Gozde Sahin, Pei Wang, Zhaowei Cai, Robik Shrestha, Hao Yang, Davide Modolo• 2026

Related benchmarks

Task	Dataset	Result
Spatial Reasoning	V-Star	Average Score92.5	7
Spatial Reasoning	HR-Bench-4K	HR-4K Average Score75.9	7
Multi-step Visual Reasoning	Multi-step Visual Reasoning Suite (DocVQA-RF, TableVQA, VisualProbe, ChartQA-Pro) (train)	Average Tool Calls3.4	7
Spatial Reasoning	HR-Bench-8K	HR-8K Average Accuracy73.2	6
Chart/Table Understanding	CharXiv	Accuracy43.5	5
Spatial Reasoning	VisualProbe	Accuracy46.5	5
Chart/Table Understanding	ChartQA Pro	Accuracy38.8	4
Chart/Table Understanding	TableVQA	Accuracy70.2	4
Document Understanding	DocVQA-RF	ANLS77.3	4
Document Understanding	InfoVQA-RF	ANLS61.4	4

Showing 10 of 11 rows

Other info

Follow for update

@wizwand_team Discord