VLM-Grounder: A VLM Agent for Zero-Shot 3D Visual Grounding

About

3D visual grounding is crucial for robots, requiring integration of natural language and 3D scene understanding. Traditional methods depending on supervised learning with 3D point clouds are limited by scarce datasets. Recently zero-shot methods leveraging LLMs have been proposed to address the data issue. While effective, these methods only use object-centric information, limiting their ability to handle complex queries. In this work, we present VLM-Grounder, a novel framework using vision-language models (VLMs) for zero-shot 3D visual grounding based solely on 2D images. VLM-Grounder dynamically stitches image sequences, employs a grounding and feedback scheme to find the target object, and uses a multi-view ensemble projection to accurately estimate 3D bounding boxes. Experiments on ScanRefer and Nr3D datasets show VLM-Grounder outperforms previous zero-shot methods, achieving 51.6% Acc@0.25 on ScanRefer and 48.0% Acc on Nr3D, without relying on 3D geometry or object priors. Codes are available at https://github.com/OpenRobotLab/VLM-Grounder .

Runsen Xu, Zhiwei Huang, Tai Wang, Yilun Chen, Jiangmiao Pang, Dahua Lin• 2024

Related benchmarks

Task	Dataset	Result
3D Visual Grounding	ScanRefer (val)	Overall Accuracy @ IoU 0.5032.8	253
3D Visual Grounding	ScanRefer	Acc@0.529.8	142
3D Visual Grounding	Nr3D	Overall Success Rate48	97
3D Visual Grounding	Nr3D (test)	Overall Success Rate48	88
3D Visual Grounding	ScanRefer Overall	Acc @ 0.2551.6	41
3D Visual Grounding	ScanRefer Unique	Acc@0.25 (IoU=0.25)66	41
3D Visual Grounding	Nr3D (val)	Easy Score55.2	20
3D Visual Grounding	ScanRefer Multiple	Accuracy @ IoU=0.2548.3	17
3D Visual Grounding	Nr3D without GT object class	Easy Success55.2	13
3D Visual Grounding	Nr3D Overall	Accuracy (Nr3D Overall)48	12

Showing 10 of 17 rows

Other info

Follow for update

@wizwand_team Discord