Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Unifying 3D Vision-Language Understanding via Promptable Queries

About

A unified model for 3D vision-language (3D-VL) understanding is expected to take various scene representations and perform a wide range of tasks in a 3D scene. However, a considerable gap exists between existing methods and such a unified model, due to the independent application of representation and insufficient exploration of 3D multi-task training. In this paper, we introduce PQ3D, a unified model capable of using Promptable Queries to tackle a wide range of 3D-VL tasks, from low-level instance segmentation to high-level reasoning and planning. This is achieved through three key innovations: (1) unifying various 3D scene representations (i.e., voxels, point clouds, multi-view images) into a shared 3D coordinate space by segment-level grouping, (2) an attention-based query decoder for task-specific information retrieval guided by prompts, and (3) universal output heads for different tasks to support multi-task training. Tested across ten diverse 3D-VL datasets, PQ3D demonstrates impressive performance on these tasks, setting new records on most benchmarks. Particularly, PQ3D improves the state-of-the-art on ScanNet200 by 4.9% (AP25), ScanRefer by 5.4% (acc@0.5), Multi3DRefer by 11.7% (F1@0.5), and Scan2Cap by 13.4% (CIDEr@0.5). Moreover, PQ3D supports flexible inference with individual or combined forms of available 3D representations, e.g., solely voxel input.

Ziyu Zhu, Zhuofan Zhang, Xiaojian Ma, Xuesong Niu, Yixin Chen, Baoxiong Jia, Zhidong Deng, Siyuan Huang, Qing Li• 2024

Related benchmarks

TaskDatasetResultRank
3D Visual GroundingNr3D (test)
Overall Success Rate66.7
88
3D Visual GroundingSr3D (test)
Overall Accuracy79.7
73
3D Question AnsweringScanQA w/ objects (test)
EM@126.1
55
3D Question AnsweringSQA3D (test)
EM@147.1
55
Instance SegmentationScanNet200 (val)
mAP@5038.9
53
3D Question AnsweringScanQA w/o objects (test)
EM@120
51
3D Situated Question AnsweringSQA3D (test)
Average Accuracy47.1
40
Visual GroundingScanRefer v1 (val)--
30
3D Dense CaptioningScan2Cap
BLEU-4 @0.536
23
3D Visual GroundingScanRefer (test)
Unique Accuracy86.7
21
Showing 10 of 17 rows

Other info

Code

Follow for update