Visual Agentic AI for Spatial Reasoning with a Dynamic API

About

Visual reasoning -- the ability to interpret the visual world -- is crucial for embodied agents that operate within three-dimensional scenes. Progress in AI has led to vision and language models capable of answering questions from images. However, their performance declines when tasked with 3D spatial reasoning. To tackle the complexity of such reasoning problems, we introduce an agentic program synthesis approach where LLM agents collaboratively generate a Pythonic API with new functions to solve common subproblems. Our method overcomes limitations of prior approaches that rely on a static, human-defined API, allowing it to handle a wider range of queries. To assess AI capabilities for 3D understanding, we introduce a new benchmark of queries involving multiple steps of grounding and inference. We show that our method outperforms prior zero-shot models for visual reasoning in 3D and empirically validate the effectiveness of our agentic framework for 3D spatial reasoning tasks. Project website: https://glab-caltech.github.io/vadar/

Damiano Marsili, Rohun Agrawal, Yisong Yue, Georgia Gkioxari• 2025

Related benchmarks

Task	Dataset	Result
Spatial Reasoning	MMSI-Bench	Average Accuracy28.9	67
Spatial Reasoning	ViewSpatial-Bench	Overall Score33.7	35
Spatial Reasoning	SpatialSense SpatialScore-Hard	Accuracy40.8	16
3D Spatial Reasoning	Omni3D-Bench (test)	Yes/No Acc56	11
Multi-view spatial reasoning	MINDCUBE-1k	Overall Accuracy40.76	9
Single-view spatial reasoning	OMNI3D BENCH	Numeric Estimation (Count)21.7	9
Spatial Reasoning	VG-B SpatialScore-Hard	Accuracy39.1	8
Spatial Reasoning	3DSR-B SpatialScore-Hard	Accuracy24.8	8
Spatial Reasoning	OMNI3D-BENCH 100 held out queries	Accuracy38.9	5
Visual Question Answering	GQA VADAR	Accuracy46.1	5

Showing 10 of 14 rows

Other info

Follow for update

@wizwand_team Discord