Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Visual Programming for Zero-shot Open-Vocabulary 3D Visual Grounding

About

3D Visual Grounding (3DVG) aims at localizing 3D object based on textual descriptions. Conventional supervised methods for 3DVG often necessitate extensive annotations and a predefined vocabulary, which can be restrictive. To address this issue, we propose a novel visual programming approach for zero-shot open-vocabulary 3DVG, leveraging the capabilities of large language models (LLMs). Our approach begins with a unique dialog-based method, engaging with LLMs to establish a foundational understanding of zero-shot 3DVG. Building on this, we design a visual program that consists of three types of modules, i.e., view-independent, view-dependent, and functional modules. These modules, specifically tailored for 3D scenarios, work collaboratively to perform complex reasoning and inference. Furthermore, we develop an innovative language-object correlation module to extend the scope of existing 3D object detectors into open-vocabulary scenarios. Extensive experiments demonstrate that our zero-shot approach can outperform some supervised baselines, marking a significant stride towards effective 3DVG.

Zhihao Yuan, Jinke Ren, Chun-Mei Feng, Hengshuang Zhao, Shuguang Cui, Zhen Li• 2023

Related benchmarks

TaskDatasetResultRank
3D Visual GroundingScanRefer (val)
Overall Accuracy @ IoU 0.5032.7
155
3D Visual GroundingNr3D (test)
Overall Success Rate39
88
3D Visual GroundingNr3D
Overall Success Rate39
74
Visual GroundingScanRefer v1 (val)
Acc@0.5 (All)32.7
30
3D Visual GroundingScanRefer Unique
Acc@0.25 (IoU=0.25)63.8
24
3D Visual GroundingScanRefer Multiple (val)
Accuracy @ IoU 0.2527.7
15
3D Visual GroundingNr3D (val)
Easy Score46.5
13
3D Visual GroundingScanRefer 250 scenes (test)
Acc@0.25 (Unique)55.3
7
Showing 8 of 8 rows

Other info

Follow for update