Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

pySpatial: Generating 3D Visual Programs for Zero-Shot Spatial Reasoning

About

Multi-modal Large Language Models (MLLMs) have demonstrated strong capabilities in general-purpose perception and reasoning, but they still struggle with tasks that require spatial understanding of the 3D world. To address this, we introduce pySpatial, a visual programming framework that equips MLLMs with the ability to interface with spatial tools via Python code generation. Given an image sequence and a natural-language query, the model composes function calls to spatial tools including 3D reconstruction, camera-pose recovery, novel-view rendering, etc. These operations convert raw 2D inputs into an explorable 3D scene, enabling MLLMs to reason explicitly over structured spatial representations. Notably, pySpatial requires no gradient-based fine-tuning and operates in a fully zero-shot setting. Experimental evaluations on the challenging MindCube and Omni3D-Bench benchmarks demonstrate that our framework pySpatial consistently surpasses strong MLLM baselines; for instance, it outperforms GPT-4.1-mini by 12.94% on MindCube. Furthermore, we conduct real-world indoor navigation experiments where the robot can successfully traverse complex environments using route plans generated by pySpatial, highlighting the practical effectiveness of our approach.

Zhanpeng Luo, Ce Zhang, Silong Yong, Cunxi Dai, Qianwei Wang, Haoxi Ran, Guanya Shi, Katia Sycara, Yaqi Xie• 2026

Related benchmarks

TaskDatasetResultRank
Spatial ReasoningMMSI-Bench
Average Accuracy37.3
32
Multi-view spatial reasoningMINDCUBE full
Overall Accuracy58.56
18
Multi-view spatial reasoningMINDCUBE-1k
Overall Accuracy62.35
9
Single-view spatial reasoningOMNI3D BENCH
Numeric Estimation (Count)22.9
9
Showing 4 of 4 rows

Other info

Follow for update