Multi-Step Visual Reasoning with Visual Tokens Scaling and Verification

About

Multi-modal large language models (MLLMs) have achieved remarkable capabilities by integrating visual perception with language understanding, enabling applications such as image-grounded dialogue, visual question answering, and scientific analysis. However, most MLLMs adopt a static inference paradigm, encoding the entire image into fixed visual tokens upfront, which limits their ability to iteratively refine understanding or adapt to context during inference. This contrasts sharply with human perception, which is dynamic, selective, and feedback-driven. In this work, we introduce a novel framework for inference-time visual token scaling that enables MLLMs to perform iterative, verifier-guided reasoning over visual content. We formulate the problem as a Markov Decision Process, involving a reasoner that proposes visual actions and a verifier, which is trained via multi-step Direct Preference Optimization (DPO), that evaluates these actions and determines when reasoning should terminate. To support this, we present a new dataset, VTS, comprising supervised reasoning trajectories (VTS-SFT) and preference-labeled reasoning comparisons (VTS-DPO). Our method significantly outperforms existing approaches across diverse visual reasoning benchmarks, offering not only improved accuracy but also more interpretable and grounded reasoning processes. These results demonstrate the promise of dynamic inference mechanisms for enabling fine-grained, context-aware visual reasoning in next-generation MLLMs.

Tianyi Bai, Zengjie Hu, Fupeng Sun, Jiantao Qiu, Yizhen Jiang, Guangxin He, Bohan Zeng, Conghui He, Binhang Yuan, Wentao Zhang• 2025

Related benchmarks

Task	Dataset	Result
Optical Character Recognition	OCRBench	--	486
High-Resolution Visual Perception	HR-Bench-4K	Accuracy69.8	79
Visual Reasoning	V*	Accuracy74.9	72
Counting	TallyQA	Accuracy72.8	67
High-Resolution Visual Perception	HR-Bench-8K	Accuracy67.3	63
Visual Reasoning	BLINK	Jigsaw Accuracy56.1	57
Image Reasoning	WeMath	Accuracy42.8	34
Multimodal Mathematical Reasoning	MathVision	Pass@1 Accuracy27	31
Visual Reasoning	VLMs are Blind	Accuracy42.1	28
Medical Visual Question Answering	Slake	Accuracy57.9	25

Showing 10 of 16 rows

Other info

Follow for update

@wizwand_team Discord