Chatting with Images for Introspective Visual Thinking

About

Current large vision-language models (LVLMs) typically rely on text-only reasoning based on a single-pass visual encoding, which often leads to loss of fine-grained visual information. Recently the proposal of ''thinking with images'' attempts to alleviate this limitation by manipulating images via external tools or code; however, the resulting visual states are often insufficiently grounded in linguistic semantics, impairing effective cross-modal alignment - particularly when visual semantics or geometric relationships must be reasoned over across distant regions or multiple images. To address these challenges, we propose ''chatting with images'', a new framework that reframes visual manipulation as language-guided feature modulation. Under the guidance of expressive language prompts, the model dynamically performs joint re-encoding over multiple image regions, enabling tighter coupling between linguistic reasoning and visual state updates. We instantiate this paradigm in ViLaVT, a novel LVLM equipped with a dynamic vision encoder explicitly designed for such interactive visual reasoning, and trained it with a two-stage curriculum combining supervised fine-tuning and reinforcement learning to promote effective reasoning behaviors. Extensive experiments across eight benchmarks demonstrate that ViLaVT achieves strong and consistent improvements, with particularly pronounced gains on complex multi-image and video-based spatial reasoning tasks.

Junfei Wu, Jian Guan, Qiang Liu, Shu Wu, Liang Wang, Wei Wu, Tieniu Tan• 2026

Related benchmarks

Task	Dataset	Result
Visual Question Answering	HRBench 4K	Accuracy0.755	61
Visual Question Answering	HRBench-8K	Accuracy69.3	51
Spatial Reasoning (Multi-Image)	MMSI-Bench	Accuracy31.3	44
Spatial Reasoning (Multi-Image)	SPAR-Bench	Accuracy52.6	31
Spatial Reasoning (Video)	VSI-Bench	Accuracy52	30
Spatial Reasoning (Multi-Image)	ERQA	Accuracy42.2	23
Spatial Reasoning (Single-Image)	SpatialEval Real	Accuracy68.9	10
Spatial Reasoning (Single-Image)	EmbSpatial	Accuracy69.3	10

Showing 8 of 8 rows

Other info

Follow for update

@wizwand_team Discord