Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Token Warping Helps MLLMs Look from Nearby Viewpoints

About

Can warping tokens, rather than pixels, help multimodal large language models (MLLMs) understand how a scene appears from a nearby viewpoint? While MLLMs perform well on visual reasoning, they remain fragile to viewpoint changes, as pixel-wise warping is highly sensitive to small depth errors and often introduces geometric distortions. Drawing on theories of mental imagery that posit part-level structural representations as the basis for human perspective transformation, we examine whether image tokens in ViT-based MLLMs serve as an effective substrate for viewpoint changes. We compare forward and backward warping, finding that backward token warping, which defines a dense grid on the target view and retrieves a corresponding source-view token for each grid point, achieves greater stability and better preserves semantic coherence under viewpoint shifts. Experiments on our proposed ViewBench benchmark demonstrate that token-level warping enables MLLMs to reason reliably from nearby viewpoints, consistently outperforming all baselines including pixel-wise warping approaches, spatially fine-tuned MLLMs, and a generative warping method.

Phillip Y. Lee, Chanho Park, Mingue Park, Seungwoo Yoo, Juil Koo, Minhyuk Sung• 2026

Related benchmarks

TaskDatasetResultRank
View-Conditioned Spatial ReasoningViewBench Text
Accuracy81.73
48
View-Conditioned Spatial ReasoningViewBench Shape
Accuracy75.72
48
Target-view object descriptionViewBench Object
Score (1-10)6.29
36
Spatial ReasoningViewBench-Text 5–15% overlap
Accuracy0.7789
31
Spatial ReasoningViewBench Shape (5–15% overlap)
Accuracy67.44
10
Target-view object descriptionViewBench-Object 5–15% overlap
Object Description Score5.18
6
Showing 6 of 6 rows

Other info

GitHub

Follow for update