Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning

About

Large Vision-Language Models (VLMs) have shown strong capabilities in multimodal understanding and reasoning, yet they are primarily constrained by text-based reasoning processes. However, achieving seamless integration of visual and textual reasoning which mirrors human cognitive processes remains a significant challenge. In particular, effectively incorporating advanced visual input processing into reasoning mechanisms is still an open question. Thus, in this paper, we explore the interleaved multimodal reasoning paradigm and introduce DeepEyes, a model with "thinking with images" capabilities incentivized through end-to-end reinforcement learning without the need for cold-start SFT. Notably, this ability emerges natively within the model itself, leveraging its inherent grounding ability as a tool instead of depending on separate specialized models. Specifically, we propose a tool-use-oriented data selection mechanism and a reward strategy to encourage successful tool-assisted reasoning trajectories. DeepEyes achieves significant performance gains on fine-grained perception and reasoning benchmarks and also demonstrates improvement in grounding, hallucination, and mathematical reasoning tasks. Interestingly, we observe the distinct evolution of tool-calling behavior from initial exploration to efficient and accurate exploitation, and diverse thinking patterns that closely mirror human visual reasoning processes. Code is available at https://github.com/Visual-Agent/DeepEyes.

Ziwei Zheng, Michael Yang, Jack Hong, Chenxiao Zhao, Guohai Xu, Le Yang, Chao Shen, Xing Yu• 2025

Related benchmarks

TaskDatasetResultRank
Object Hallucination EvaluationPOPE
Accuracy87.7
935
Mathematical ReasoningMathVista
Score70.8
322
Multimodal ReasoningMM-Vet
MM-Vet Score60.28
281
Mathematical ReasoningAIME 2024
Accuracy6.98
251
Mathematical ReasoningAIME 2025
Accuracy2.34
227
Multimodal UnderstandingMMStar--
197
Visual Mathematical ReasoningMathVista
Accuracy70.1
189
Multimodal UnderstandingMME
MME Score64
158
Mathematical ReasoningAMC
Accuracy18.07
151
Medical Visual Question AnsweringSlake
Accuracy68.2
134
Showing 10 of 126 rows
...

Other info

Follow for update