PanoGrounder: Bridging 2D and 3D with Panoramic Scene Representations for VLM-based 3D Visual Grounding

About

3D Visual Grounding (3DVG) is a critical bridge from vision-language perception to robotics, requiring both language understanding and 3D scene reasoning. Traditional supervised models leverage explicit 3D geometry but exhibit limited generalization, owing to the scarcity of 3D vision-language datasets and the limited reasoning capabilities compared to modern vision-language models (VLMs). We propose PanoGrounder, a generalizable 3DVG framework that couples multi-modal panoramic representation with pretrained 2D VLMs for strong vision-language reasoning. Panoramic renderings, augmented with 3D semantic and geometric features, serve as an intermediate representation between 2D and 3D, and offer two major benefits: (i) they can be directly fed to VLMs with minimal adaptation and (ii) they retain long-range object-to-object relations thanks to their 360-degree field of view. We devise a three-stage pipeline that places a compact set of panoramic viewpoints considering the scene layout and geometry, grounds a text query on each panoramic rendering with a VLM, and fuses per-view predictions into a single 3D bounding box via lifting. Our approach achieves state-of-the-art results on ScanRefer and Nr3D, and demonstrates superior generalization to unseen 3D datasets and text rephrasings.

Seongmin Jung, Seongho Choi, Gunwoo Jeon, Minsu Cho, Jongwoo Lim• 2025

Related benchmarks

Task	Dataset	Result
3D Visual Grounding	Nr3D (test)	Overall Success Rate76.1	88
3D Visual Grounding	Sr3D (test)	Overall Accuracy79.9	73
3D Visual Grounding	ScanRefer (test)	Unique Accuracy85	21
3D Visual Grounding	ARKitScenes (test)	Unique Success Rate74.2	5
3D Visual Grounding	ScanRefer ScanNet v2 (val)	Unique Acc91.7	5
3D Visual Grounding	3RScan (test)	Unique Success Rate80.4	3

Showing 6 of 6 rows

Other info

Follow for update

@wizwand_team Discord