Visual Programming for Zero-shot Open-Vocabulary 3D Visual Grounding

About

3D Visual Grounding (3DVG) aims at localizing 3D object based on textual descriptions. Conventional supervised methods for 3DVG often necessitate extensive annotations and a predefined vocabulary, which can be restrictive. To address this issue, we propose a novel visual programming approach for zero-shot open-vocabulary 3DVG, leveraging the capabilities of large language models (LLMs). Our approach begins with a unique dialog-based method, engaging with LLMs to establish a foundational understanding of zero-shot 3DVG. Building on this, we design a visual program that consists of three types of modules, i.e., view-independent, view-dependent, and functional modules. These modules, specifically tailored for 3D scenarios, work collaboratively to perform complex reasoning and inference. Furthermore, we develop an innovative language-object correlation module to extend the scope of existing 3D object detectors into open-vocabulary scenarios. Extensive experiments demonstrate that our zero-shot approach can outperform some supervised baselines, marking a significant stride towards effective 3DVG.

Zhihao Yuan, Jinke Ren, Chun-Mei Feng, Hengshuang Zhao, Shuguang Cui, Zhen Li• 2023

Related benchmarks

Task	Dataset	Result
3D Visual Grounding	ScanRefer (val)	Overall Accuracy @ IoU 0.5032.7	253
3D Visual Grounding	Nr3D	Overall Success Rate39	97
3D Visual Grounding	Nr3D (test)	Overall Success Rate39	88
3D Visual Grounding	ScanRefer Unique	Acc@0.25 (IoU=0.25)63.8	41
3D Visual Grounding	ScanRefer Overall	Acc @ 0.2536.4	41
Visual Grounding	ScanRefer v1 (val)	Acc@0.5 (Unique)58.4	35
3D Visual Grounding	Nr3D (val)	Easy Score46.5	20
3D Visual Grounding	ScanRefer Multiple	Accuracy @ IoU=0.2527.7	17
3D Visual Grounding	ScanRefer Multiple (val)	Accuracy @ IoU 0.2527.7	15
3D Visual Grounding	Nr3D without GT object class	Easy Success46.5	13

Showing 10 of 17 rows

Other info

Follow for update

@wizwand_team Discord