Think3D: Thinking with Space for Spatial Reasoning
About
While contemporary Vision-Language Models (VLMs) excel at 2D visual understanding, they remain constrained by a passive, 2D-centric paradigm that severely limits genuine 3D spatial reasoning. To bridge this gap, we introduce Think3D, a novel framework that equips VLM agents with interactive, 3D chain-of-thought reasoning capabilities. By integrating a suite of 3D manipulation tools, Think3D transforms passive perception into active spatial exploration, closely mirroring human geometric reasoning. We demonstrate that Think3D acts as a highly effective zero-shot plug-in for state-of-the-art closed-source models (e.g., GPT-4.1, Gemini 2.5 Pro), yielding absolute performance gains of +7.8% on BLINK Multi-view and MindCube, and +4.7% on VSI-Bench. Furthermore, to optimize tool-use in smaller open-weight models, we propose Think3D-RL, a reinforcement learning paradigm designed to autonomously learn spatial exploration strategies. When applied to Qwen3-VL-4B, Think3D-RL amplifies the performance gain from a marginal +0.7% to a substantial +10.7%. Notably, this RL formulation induces an exploration policy that qualitatively aligns with the sophisticated behavior of much larger models, entirely circumventing the need for costly operation-trajectory annotations. Ultimately, Think3D establishes tool-augmented active exploration as an effective paradigm for unlocking human-like 3D reasoning in multimodal agents. Code, models, and data are available at https://github.com/zhangzaibin/spagent.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Spatial Reasoning | VSI-Bench tiny | Avg Score51.61 | 39 | |
| Spatial Reasoning | BLINK Multi-view (test) | Accuracy63.91 | 15 | |
| Spatial Reasoning | MindCube Subset (test) | Rotation Score86.67 | 15 | |
| Visual Spatial Inference | VSI-Bench Tiny video-input | Object Count Score45.8 | 12 | |
| Spatial Reasoning | VLM4D-Real Ego-centric | Accuracy72 | 11 | |
| Spatial Reasoning | VLM4D-Real Exo-centric | Accuracy70 | 11 | |
| Multi-view spatial reasoning | MindCube de-biased Tiny (test) | Among Score54 | 11 | |
| Spatial Reasoning | VLM4D-Real Overall | Accuracy (%)69 | 11 | |
| Active Spatial Understanding | S3-Eval (simulation) | Overall Score44 | 4 | |
| Active Vision Spatial Understanding | S3-Eval real | Overall Score42 | 4 |