Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Thinking with Images via Self-Calling Agent

About

Thinking-with-images paradigms have showcased remarkable visual reasoning capability by integrating visual information as dynamic elements into the Chain-of-Thought (CoT). However, optimizing interleaved multimodal CoT (iMCoT) through reinforcement learning remains challenging, as it relies on scarce high-quality reasoning data. In this study, we propose Self-Calling Chain-of-Thought (sCoT), a novel visual reasoning paradigm that reformulates iMCoT as a language-only CoT with self-calling. Specifically, a main agent decomposes the complex visual reasoning task to atomic subtasks and invokes its virtual replicas, i.e. parameter-sharing subagents, to solve them in isolated context. sCoT enjoys substantial training effectiveness and efficiency, as it requires no explicit interleaving between modalities. sCoT employs group-relative policy optimization to reinforce effective reasoning behavior to enhance optimization. Experiments on HR-Bench 4K show that sCoT improves the overall reasoning performance by up to $1.9\%$ with $\sim 75\%$ fewer GPU hours compared to strong baseline approaches. Code is available at https://github.com/YWenxi/think-with-images-through-self-calling.

Wenxi Yang, Yuzhong Zhao, Fang Wan, Qixiang Ye• 2025

Related benchmarks

TaskDatasetResultRank
Hallucination EvaluationPOPE--
132
Optical Character Recognition EvaluationOCRBench
Score0.845
46
Visual GroundingRefCOCO+
Accuracy @ 0.5 IoU81.97
20
Visual GroundingRefCOCOg
Accuracy82.96
17
Visual ReasoningV*
Overall Score91.6
10
Visual ReasoningHR-Bench-4K
FSP0.933
7
Visual ReasoningHR-Bench-8K
FSP87
7
Showing 7 of 7 rows

Other info

GitHub

Follow for update