Token Warping Helps MLLMs Look from Nearby Viewpoints
About
Can warping tokens, rather than pixels, help multimodal large language models (MLLMs) understand how a scene appears from a nearby viewpoint? While MLLMs perform well on visual reasoning, they remain fragile to viewpoint changes, as pixel-wise warping is highly sensitive to small depth errors and often introduces geometric distortions. Drawing on theories of mental imagery that posit part-level structural representations as the basis for human perspective transformation, we examine whether image tokens in ViT-based MLLMs serve as an effective substrate for viewpoint changes. We compare forward and backward warping, finding that backward token warping, which defines a dense grid on the target view and retrieves a corresponding source-view token for each grid point, achieves greater stability and better preserves semantic coherence under viewpoint shifts. Experiments on our proposed ViewBench benchmark demonstrate that token-level warping enables MLLMs to reason reliably from nearby viewpoints, consistently outperforming all baselines including pixel-wise warping approaches, spatially fine-tuned MLLMs, and a generative warping method.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| View-Conditioned Spatial Reasoning | ViewBench Text | Accuracy81.73 | 48 | |
| View-Conditioned Spatial Reasoning | ViewBench Shape | Accuracy75.72 | 48 | |
| Target-view object description | ViewBench Object | Score (1-10)6.29 | 36 | |
| Spatial Reasoning | ViewBench-Text 5–15% overlap | Accuracy0.7789 | 31 | |
| Spatial Reasoning | ViewBench Shape (5–15% overlap) | Accuracy67.44 | 10 | |
| Target-view object description | ViewBench-Object 5–15% overlap | Object Description Score5.18 | 6 |