SPAZER: Spatial-Semantic Progressive Reasoning Agent for Zero-shot 3D Visual Grounding

About

3D Visual Grounding (3DVG) aims to localize target objects within a 3D scene based on natural language queries. To alleviate the reliance on costly 3D training data, recent studies have explored zero-shot 3DVG by leveraging the extensive knowledge and powerful reasoning capabilities of pre-trained LLMs and VLMs. However, existing paradigms tend to emphasize either spatial (3D-based) or semantic (2D-based) understanding, limiting their effectiveness in complex real-world applications. In this work, we introduce SPAZER - a VLM-driven agent that combines both modalities in a progressive reasoning framework. It first holistically analyzes the scene and produces a 3D rendering from the optimal viewpoint. Based on this, anchor-guided candidate screening is conducted to perform a coarse-level localization of potential objects. Furthermore, leveraging retrieved relevant 2D camera images, 3D-2D joint decision-making is efficiently performed to determine the best-matching object. By bridging spatial and semantic reasoning neural streams, SPAZER achieves robust zero-shot grounding without training on 3D-labeled data. Extensive experiments on ScanRefer and Nr3D benchmarks demonstrate that SPAZER significantly outperforms previous state-of-the-art zero-shot methods, achieving notable gains of 9.0% and 10.9% in accuracy.

Zhao Jin, Rong-Cheng Tu, Jingyi Liao, Wenhao Sun, Xiao Luo, Shunyu Liu, Dacheng Tao• 2025

Related benchmarks

Task	Dataset	Result
3D Visual Grounding	ScanRefer (val)	Overall Accuracy @ IoU 0.5048.8	253
3D Visual Grounding	ScanRefer	Acc@0.543.4	142
3D Visual Grounding	Nr3D	Overall Success Rate63.8	97
3D Visual Grounding	Nr3D (test)	Overall Success Rate56	88
3D Visual Grounding	ScanRefer Overall	Acc @ 0.2557.2	41
3D Visual Grounding	ScanRefer Unique	Acc@0.25 (IoU=0.25)80.9	41
3D Visual Grounding	ScanRefer Multiple	Accuracy @ IoU=0.2551.7	17
3D Visual Grounding	Nr3D without GT object class	Easy Success68	13
3D box localization	ScanRefer	Accuracy @ 0.25 IoU57.2	11
3D Visual Grounding	ScanRefer 250 scenes (test)	Acc@0.25 (Unique)80.9	7

Showing 10 of 10 rows

Other info

Follow for update

@wizwand_team Discord