InterSketch: An Interleaved Reasoning Model with Self-correcting Visual Sketch and Stepwise Reward

About

While vision-language models (VLMs) have exhibited multi-turn visual reasoning capabilities, their reasoning trajectories remain relatively shallow and are dominated by a text-centric paradigm, limiting their applicability to complex visual challenges. In contrast, human-like thought typically involves long-horizon reasoning with an interleaved visual-textual chain-of-thought (VT-CoT). To bridge this gap, we introduce InterSketch, an interleaved reasoning model to enhance the VT-CoT capability via self-correcting and stepwise reward mechanisms. InterSketch dynamically generates intermediate visual sketches using external tools and interleaves them with textual reasoning, enabling effective perception and logical reasoning over long-horizon visual understanding tasks. Specifically, in the first cold-start stage, we propose a synthesized high-quality interleaved VT-CoT dataset and include a reflection mechanism to enable the model's capability in multi-turn interleaved reasoning and self-correction. In the subsequent reinforcement learning (RL) stage, we design a stepwise reward mechanism to mitigate the sparsity of reward signals inherent in end-only supervision over long-horizon reasoning. Extensive experiments on visual reasoning benchmarks demonstrate the effectiveness of InterSketch, even outperforming proprietary models such as Gemini-3-Pro.

Zhiwei Ning, Wenwen Tong, Xiangli Kong, Shengnan Ma, Ziyi Shang, Jingcheng Ni, Tao Hu, Yong Xien Chng, Jixuan Ying, Zehuan Wu, Hanming Deng, Jie Yang, Yuanjie Zheng, Wei Liu, Lewei Lu• 2026

Related benchmarks

Task	Dataset	Result
Multimodal Understanding	MMStar	Accuracy57.7	511
Visual Search	V*	Accuracy83.8	53
Visual Search	HR-Bench-4K	Accuracy78.5	37
Visual Search	HR-Bench-8K	Accuracy77.8	29
Real-world Multimodal Understanding	RealworldQA	Accuracy70.5	21
Visual Reasoning	TIR-Bench	Average Score51.8	15
Visual Reasoning	UniMMMU Maze	Accuracy12	7
Visual Reasoning	RealUnify	Accuracy37.8	7
Visual Spatial Planning	VSP	Accuracy73.2	7
Knowledge Reasoning	BLINK	Accuracy63	5

Showing 10 of 12 rows

Other info

Follow for update

@wizwand_team Discord