InterSketch: An Interleaved Reasoning Model with Self-correcting Visual Sketch and Stepwise Reward
About
While vision-language models (VLMs) have exhibited multi-turn visual reasoning capabilities, their reasoning trajectories remain relatively shallow and are dominated by a text-centric paradigm, limiting their applicability to complex visual challenges. In contrast, human-like thought typically involves long-horizon reasoning with an interleaved visual-textual chain-of-thought (VT-CoT). To bridge this gap, we introduce InterSketch, an interleaved reasoning model to enhance the VT-CoT capability via self-correcting and stepwise reward mechanisms. InterSketch dynamically generates intermediate visual sketches using external tools and interleaves them with textual reasoning, enabling effective perception and logical reasoning over long-horizon visual understanding tasks. Specifically, in the first cold-start stage, we propose a synthesized high-quality interleaved VT-CoT dataset and include a reflection mechanism to enable the model's capability in multi-turn interleaved reasoning and self-correction. In the subsequent reinforcement learning (RL) stage, we design a stepwise reward mechanism to mitigate the sparsity of reward signals inherent in end-only supervision over long-horizon reasoning. Extensive experiments on visual reasoning benchmarks demonstrate the effectiveness of InterSketch, even outperforming proprietary models such as Gemini-3-Pro.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Multimodal Understanding | MMStar | Accuracy57.7 | 407 | |
| Visual Search | HR-Bench-8K | Accuracy77.8 | 29 | |
| Visual Search | HR-Bench-4K | Accuracy78.5 | 29 | |
| Visual Search | V* | Accuracy83.8 | 28 | |
| Real-world Multimodal Understanding | RealworldQA | Accuracy70.5 | 21 | |
| Visual Reasoning | TIR-Bench | Average Score51.8 | 15 | |
| Visual Reasoning | UniMMMU Maze | Accuracy12 | 7 | |
| Visual Reasoning | RealUnify | Accuracy37.8 | 7 | |
| Visual Spatial Planning | VSP | Accuracy73.2 | 7 | |
| Knowledge Reasoning | BLINK | Accuracy63 | 5 |