Unifying 3D Vision-Language Understanding via Promptable Queries
About
A unified model for 3D vision-language (3D-VL) understanding is expected to take various scene representations and perform a wide range of tasks in a 3D scene. However, a considerable gap exists between existing methods and such a unified model, due to the independent application of representation and insufficient exploration of 3D multi-task training. In this paper, we introduce PQ3D, a unified model capable of using Promptable Queries to tackle a wide range of 3D-VL tasks, from low-level instance segmentation to high-level reasoning and planning. This is achieved through three key innovations: (1) unifying various 3D scene representations (i.e., voxels, point clouds, multi-view images) into a shared 3D coordinate space by segment-level grouping, (2) an attention-based query decoder for task-specific information retrieval guided by prompts, and (3) universal output heads for different tasks to support multi-task training. Tested across ten diverse 3D-VL datasets, PQ3D demonstrates impressive performance on these tasks, setting new records on most benchmarks. Particularly, PQ3D improves the state-of-the-art on ScanNet200 by 4.9% (AP25), ScanRefer by 5.4% (acc@0.5), Multi3DRefer by 11.7% (F1@0.5), and Scan2Cap by 13.4% (CIDEr@0.5). Moreover, PQ3D supports flexible inference with individual or combined forms of available 3D representations, e.g., solely voxel input.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| 3D Visual Grounding | Nr3D (test) | Overall Success Rate66.7 | 88 | |
| 3D Visual Grounding | Sr3D (test) | Overall Accuracy79.7 | 73 | |
| 3D Question Answering | ScanQA w/ objects (test) | EM@126.1 | 55 | |
| 3D Question Answering | SQA3D (test) | EM@147.1 | 55 | |
| Instance Segmentation | ScanNet200 (val) | mAP@5038.9 | 53 | |
| 3D Question Answering | ScanQA w/o objects (test) | EM@120 | 51 | |
| 3D Situated Question Answering | SQA3D (test) | Average Accuracy47.1 | 40 | |
| Visual Grounding | ScanRefer v1 (val) | -- | 30 | |
| 3D Dense Captioning | Scan2Cap | BLEU-4 @0.536 | 23 | |
| 3D Visual Grounding | ScanRefer (test) | Unique Accuracy86.7 | 21 |