Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

MAG-3D: Multi-Agent Grounded Reasoning for 3D Understanding

About

Vision-language models (VLMs) have achieved strong performance in multimodal understanding and reasoning, yet grounded reasoning in 3D scenes remains underexplored. Effective 3D reasoning hinges on accurate grounding: to answer open-ended queries, a model must first identify query-relevant objects and regions in a complex scene, and then reason about their spatial and geometric relationships. Recent approaches have demonstrated strong potential for grounded 3D reasoning. However, they often rely on in-domain tuning or hand-crafted reasoning pipelines, which limit their flexibility and zero-shot generalization to novel environments. In this work, we present MAG-3D, a training-free multi-agent framework for grounded 3D reasoning with off-the-shelf VLMs. Instead of relying on task-specific training or fixed reasoning procedures, MAG-3D dynamically coordinates expert agents to address the key challenges of 3D reasoning. Specifically, we propose a planning agent that decomposes the task and orchestrates the overall reasoning process, a grounding agent that performs free-form 3D grounding and relevant frame retrieval from extensive 3D scene observations, and a coding agent that conducts flexible geometric reasoning and explicit verification through executable programs. This multi-agent collaborative design enables flexible training-free 3D grounded reasoning across diverse scenes and achieves state-of-the-art performance on challenging benchmarks.

Henry Zheng, Chenyue Fang, Rui Huang, Siyuan Wei, Xiao Liu, Gao Huang• 2026

Related benchmarks

TaskDatasetResultRank
3D Question AnsweringMSQA
Count Accuracy44.1
25
3D Question AnsweringBeacon3D
Case Score65
23
Grounding-QA Chain AnalysisBeacon3D
T1 Score19.2
6
3D Question AnsweringMSQA Vision-only
Count Accuracy23.6
4
Showing 4 of 4 rows

Other info

Follow for update