Visual Programming for Zero-shot Open-Vocabulary 3D Visual Grounding
About
3D Visual Grounding (3DVG) aims at localizing 3D object based on textual descriptions. Conventional supervised methods for 3DVG often necessitate extensive annotations and a predefined vocabulary, which can be restrictive. To address this issue, we propose a novel visual programming approach for zero-shot open-vocabulary 3DVG, leveraging the capabilities of large language models (LLMs). Our approach begins with a unique dialog-based method, engaging with LLMs to establish a foundational understanding of zero-shot 3DVG. Building on this, we design a visual program that consists of three types of modules, i.e., view-independent, view-dependent, and functional modules. These modules, specifically tailored for 3D scenarios, work collaboratively to perform complex reasoning and inference. Furthermore, we develop an innovative language-object correlation module to extend the scope of existing 3D object detectors into open-vocabulary scenarios. Extensive experiments demonstrate that our zero-shot approach can outperform some supervised baselines, marking a significant stride towards effective 3DVG.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| 3D Visual Grounding | ScanRefer (val) | Overall Accuracy @ IoU 0.5032.7 | 155 | |
| 3D Visual Grounding | Nr3D (test) | Overall Success Rate39 | 88 | |
| 3D Visual Grounding | Nr3D | Overall Success Rate39 | 74 | |
| Visual Grounding | ScanRefer v1 (val) | Acc@0.5 (All)32.7 | 30 | |
| 3D Visual Grounding | ScanRefer Unique | Acc@0.25 (IoU=0.25)63.8 | 24 | |
| 3D Visual Grounding | ScanRefer Multiple (val) | Accuracy @ IoU 0.2527.7 | 15 | |
| 3D Visual Grounding | Nr3D (val) | Easy Score46.5 | 13 | |
| 3D Visual Grounding | ScanRefer 250 scenes (test) | Acc@0.25 (Unique)55.3 | 7 |