Visual Agentic Reinforcement Fine-Tuning
About
A key trend in Large Reasoning Models (e.g., OpenAI's o3) is the native agentic ability to use external tools such as web browsers for searching and writing/executing code for image manipulation to think with images. In the open-source research community, while significant progress has been made in language-only agentic abilities such as function calling and tool integration, the development of multi-modal agentic capabilities that involve truly thinking with images, and their corresponding benchmarks, are still less explored. This work highlights the effectiveness of Visual Agentic Reinforcement Fine-Tuning (Visual-ARFT) for enabling flexible and adaptive reasoning abilities for Large Vision-Language Models (LVLMs). With Visual-ARFT, open-source LVLMs gain the ability to browse websites for real-time information updates and write code to manipulate and analyze input images through cropping, rotation, and other image processing techniques. We also present a Multi-modal Agentic Tool Bench (MAT) with two settings (MAT-Search and MAT-Coding) designed to evaluate LVLMs' agentic search and coding abilities. Our experimental results demonstrate that Visual-ARFT outperforms its baseline by +18.6% F1 / +13.0% EM on MAT-Coding and +10.3% F1 / +8.7% EM on MAT-Search, ultimately surpassing GPT-4o. Visual-ARFT also achieves +29.3 F1% / +25.9% EM gains on existing multi-hop QA benchmarks such as 2Wiki and HotpotQA, demonstrating strong generalization capabilities. Our findings suggest that Visual-ARFT offers a promising path toward building robust and generalizable multimodal agents.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Optical Character Recognition | OCRBench | -- | 433 | |
| Visual Question Answering | SimpleVQA | Accuracy0.424 | 164 | |
| Visual Question Answering | LiveVQA | Accuracy25.4 | 116 | |
| Multimodal Search | MMSearch | Accuracy34.5 | 85 | |
| High-Resolution Visual Perception | HR-Bench-4K | Accuracy58.9 | 79 | |
| Counting | TallyQA | Accuracy70.8 | 67 | |
| High-Resolution Visual Perception | HR-Bench-8K | Accuracy54 | 63 | |
| General Visual Question Answering | SimpleVQA | Pass@142.4 | 33 | |
| Multimodal Mathematical Reasoning | MathVision | Pass@1 Accuracy21.4 | 31 | |
| Search-oriented Visual Question Answering | Search-oriented Benchmarks 1.0 (test val) | InfoSeek Score37.95 | 28 |