Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

SPAZER: Spatial-Semantic Progressive Reasoning Agent for Zero-shot 3D Visual Grounding

About

3D Visual Grounding (3DVG) aims to localize target objects within a 3D scene based on natural language queries. To alleviate the reliance on costly 3D training data, recent studies have explored zero-shot 3DVG by leveraging the extensive knowledge and powerful reasoning capabilities of pre-trained LLMs and VLMs. However, existing paradigms tend to emphasize either spatial (3D-based) or semantic (2D-based) understanding, limiting their effectiveness in complex real-world applications. In this work, we introduce SPAZER - a VLM-driven agent that combines both modalities in a progressive reasoning framework. It first holistically analyzes the scene and produces a 3D rendering from the optimal viewpoint. Based on this, anchor-guided candidate screening is conducted to perform a coarse-level localization of potential objects. Furthermore, leveraging retrieved relevant 2D camera images, 3D-2D joint decision-making is efficiently performed to determine the best-matching object. By bridging spatial and semantic reasoning neural streams, SPAZER achieves robust zero-shot grounding without training on 3D-labeled data. Extensive experiments on ScanRefer and Nr3D benchmarks demonstrate that SPAZER significantly outperforms previous state-of-the-art zero-shot methods, achieving notable gains of 9.0% and 10.9% in accuracy.

Zhao Jin, Rong-Cheng Tu, Jingyi Liao, Wenhao Sun, Xiao Luo, Shunyu Liu, Dacheng Tao• 2025

Related benchmarks

TaskDatasetResultRank
3D Visual GroundingNr3D (test)
Overall Success Rate56
88
3D Visual GroundingNr3D
Overall Success Rate63.8
74
3D Visual GroundingScanRefer Unique
Acc@0.25 (IoU=0.25)80.9
24
3D Visual GroundingScanRefer
Acc@0.2551.7
23
3D Visual GroundingScanRefer Overall
Acc @ 0.2557.2
17
3D Visual GroundingScanRefer 250 scenes (test)
Acc@0.25 (Unique)80.9
7
Showing 6 of 6 rows

Other info

Follow for update