SceneGraphGrounder: Zero-Shot 3D Visual Grounding via Structured Scene Graph Matching

About

Zero-shot 3D visual grounding requires localizing objects in unstructured environments from free-form natural language. Recent vision-language model (VLM) approaches achieve promising results but rely on view-dependent reasoning or implicit representations, limiting spatial consistency and interpretability for compositional queries. We propose SceneGraphGrounder, a framework that reformulates 3D grounding as structured graph matching over a reconstructed 3D scene graph. To enable this formulation, we introduce a visual marker prompting strategy that enables a VLM to infer object-object relationships from 2D views, which are subsequently lifted into a persistent 3D scene graph encoding both spatial and semantic relations. Given a query, we construct a query graph and perform constrained alignment with the scene graph, ensuring multi-view consistency and interpretable reasoning. Experiments on the ScanRefer benchmark demonstrate that our method achieves competitive performance among zero-shot approaches, using only RGB-D inputs. We further validate our framework through real-world deployment on a mobile robot, demonstrating robust spatial reasoning in long-horizon physical environments. We will make our code publicly available upon acceptance.

Xuefei Sun, Xujia Zhang, Brendan Crowe, Doncey Albin, Christoffer Heckman• 2026

Related benchmarks

Task	Dataset	Result	Rank
3D Visual Grounding	ScanRefer (val)	Overall Accuracy @ IoU 0.5026		262

Showing 1 of 1 rows

Other info

Follow for update

@wizwand_team Discord