Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Think3D: Thinking with Space for Spatial Reasoning

About

While contemporary Vision-Language Models (VLMs) excel at 2D visual understanding, they remain constrained by a passive, 2D-centric paradigm that severely limits genuine 3D spatial reasoning. To bridge this gap, we introduce Think3D, a novel framework that equips VLM agents with interactive, 3D chain-of-thought reasoning capabilities. By integrating a suite of 3D manipulation tools, Think3D transforms passive perception into active spatial exploration, closely mirroring human geometric reasoning. We demonstrate that Think3D acts as a highly effective zero-shot plug-in for state-of-the-art closed-source models (e.g., GPT-4.1, Gemini 2.5 Pro), yielding absolute performance gains of +7.8% on BLINK Multi-view and MindCube, and +4.7% on VSI-Bench. Furthermore, to optimize tool-use in smaller open-weight models, we propose Think3D-RL, a reinforcement learning paradigm designed to autonomously learn spatial exploration strategies. When applied to Qwen3-VL-4B, Think3D-RL amplifies the performance gain from a marginal +0.7% to a substantial +10.7%. Notably, this RL formulation induces an exploration policy that qualitatively aligns with the sophisticated behavior of much larger models, entirely circumventing the need for costly operation-trajectory annotations. Ultimately, Think3D establishes tool-augmented active exploration as an effective paradigm for unlocking human-like 3D reasoning in multimodal agents. Code, models, and data are available at https://github.com/zhangzaibin/spagent.

Zaibin Zhang, Yuhan Wu, Lianjie Jia, Yifan Wang, Zhongbo Zhang, Yijiang Li, Binghao Ran, Fuxi Zhang, Zhuohan Sun, Zhenfei Yin, Lijun Wang, Huchuan Lu• 2026

Related benchmarks

TaskDatasetResultRank
Spatial ReasoningVSI-Bench tiny
Avg Score51.61
39
Spatial ReasoningBLINK Multi-view (test)
Accuracy63.91
15
Spatial ReasoningMindCube Subset (test)
Rotation Score86.67
15
Visual Spatial InferenceVSI-Bench Tiny video-input
Object Count Score45.8
12
Spatial ReasoningVLM4D-Real Ego-centric
Accuracy72
11
Spatial ReasoningVLM4D-Real Exo-centric
Accuracy70
11
Multi-view spatial reasoningMindCube de-biased Tiny (test)
Among Score54
11
Spatial ReasoningVLM4D-Real Overall
Accuracy (%)69
11
Active Spatial UnderstandingS3-Eval (simulation)
Overall Score44
4
Active Vision Spatial UnderstandingS3-Eval real
Overall Score42
4
Showing 10 of 10 rows

Other info

GitHub

Follow for update