Unifying 3D Vision-Language Understanding via Promptable Queries

About

A unified model for 3D vision-language (3D-VL) understanding is expected to take various scene representations and perform a wide range of tasks in a 3D scene. However, a considerable gap exists between existing methods and such a unified model, due to the independent application of representation and insufficient exploration of 3D multi-task training. In this paper, we introduce PQ3D, a unified model capable of using Promptable Queries to tackle a wide range of 3D-VL tasks, from low-level instance segmentation to high-level reasoning and planning. This is achieved through three key innovations: (1) unifying various 3D scene representations (i.e., voxels, point clouds, multi-view images) into a shared 3D coordinate space by segment-level grouping, (2) an attention-based query decoder for task-specific information retrieval guided by prompts, and (3) universal output heads for different tasks to support multi-task training. Tested across ten diverse 3D-VL datasets, PQ3D demonstrates impressive performance on these tasks, setting new records on most benchmarks. Particularly, PQ3D improves the state-of-the-art on ScanNet200 by 4.9% (AP25), ScanRefer by 5.4% (acc@0.5), Multi3DRefer by 11.7% (F1@0.5), and Scan2Cap by 13.4% (CIDEr@0.5). Moreover, PQ3D supports flexible inference with individual or combined forms of available 3D representations, e.g., solely voxel input.

Ziyu Zhu, Zhuofan Zhang, Xiaojian Ma, Xuesong Niu, Yixin Chen, Baoxiong Jia, Zhidong Deng, Siyuan Huang, Qing Li• 2024

Related benchmarks

Task	Dataset	Result
3D Visual Grounding	ScanRefer (val)	Overall Accuracy @ IoU 0.5051.2	253
3D Visual Grounding	ScanRefer	Acc@0.551.2	142
3D Question Answering	SQA3D (test)	EM@147.1	131
3D Dense Captioning	Scan2Cap	CIDEr @0.580.3	106
3D Visual Grounding	Nr3D (test)	Overall Success Rate66.7	88
3D Visual Grounding	Sr3D (test)	Overall Accuracy79.7	73
Instance Segmentation	ScanNet200 (val)	mAP@5038.9	72
3D Question Answering	SQA3D	EM47.1	69
3D Question Answering	ScanQA w/ objects (test)	EM@126.1	55
3D Question Answering	ScanQA w/o objects (test)	EM@120	51

Showing 10 of 31 rows

Other info

Code

Follow for update

@wizwand_team Discord