Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Chatting with Images for Introspective Visual Thinking

About

Current large vision-language models (LVLMs) typically rely on text-only reasoning based on a single-pass visual encoding, which often leads to loss of fine-grained visual information. Recently the proposal of ''thinking with images'' attempts to alleviate this limitation by manipulating images via external tools or code; however, the resulting visual states are often insufficiently grounded in linguistic semantics, impairing effective cross-modal alignment - particularly when visual semantics or geometric relationships must be reasoned over across distant regions or multiple images. To address these challenges, we propose ''chatting with images'', a new framework that reframes visual manipulation as language-guided feature modulation. Under the guidance of expressive language prompts, the model dynamically performs joint re-encoding over multiple image regions, enabling tighter coupling between linguistic reasoning and visual state updates. We instantiate this paradigm in ViLaVT, a novel LVLM equipped with a dynamic vision encoder explicitly designed for such interactive visual reasoning, and trained it with a two-stage curriculum combining supervised fine-tuning and reinforcement learning to promote effective reasoning behaviors. Extensive experiments across eight benchmarks demonstrate that ViLaVT achieves strong and consistent improvements, with particularly pronounced gains on complex multi-image and video-based spatial reasoning tasks.

Junfei Wu, Jian Guan, Qiang Liu, Shu Wu, Liang Wang, Wei Wu, Tieniu Tan• 2026

Related benchmarks

TaskDatasetResultRank
Visual Question AnsweringHRBench 4K
Accuracy0.755
54
Visual Question AnsweringHRBench-8K
Accuracy69.3
51
Spatial Reasoning (Multi-Image)MMSI-Bench
Accuracy31.3
29
Spatial Reasoning (Video)VSI-Bench
Accuracy52
14
Spatial Reasoning (Multi-Image)SPAR-Bench
Accuracy52.6
13
Spatial Reasoning (Single-Image)SpatialEval Real
Accuracy68.9
10
Spatial Reasoning (Single-Image)EmbSpatial
Accuracy69.3
10
Spatial Reasoning (Multi-Image)ERQA
Accuracy42.2
8
Showing 8 of 8 rows

Other info

Follow for update