ViThinker: Active Vision-Language Reasoning via Dynamic Perceptual Querying

About

Chain-of-Thought (CoT) reasoning excels in language models but struggles in vision-language models due to premature visual-to-text conversion that discards continuous information such as geometry and spatial layout. While recent methods enhance CoT through static enumeration or attention-based selection, they remain passive, i.e., processing pre-computed inputs rather than actively seeking task-relevant details. Inspired by human active perception, we introduce ViThinker, a framework that enables vision-language models to autonomously generate decision (query) tokens triggering the synthesis of expert-aligned visual features on demand. ViThinker internalizes vision-expert capabilities during training, performing generative mental simulation during inference without external tool calls. Through a two-stage curriculum: first distilling frozen experts into model parameters, then learning task-driven querying via sparsity penalties, i.e., ViThinker discovers minimal sufficient perception for each reasoning step. Evaluations across vision-centric benchmarks demonstrate consistent improvements, validating that active query generation outperforms passive approaches in both perceptual grounding and reasoning accuracy.

Weihang You, Qingchan Zhu, David Liu, Yi Pan, Geng Yuan, Hanqi Jiang• 2026

Related benchmarks

Task	Dataset	Result	Rank
Visual Reasoning	Vision-Centric Benchmarks	BLINK Score59.1		20

Showing 1 of 1 rows

Other info

Follow for update

@wizwand_team Discord