3D-SPS: Single-Stage 3D Visual Grounding via Referred Point Progressive Selection

About

3D visual grounding aims to locate the referred target object in 3D point cloud scenes according to a free-form language description. Previous methods mostly follow a two-stage paradigm, i.e., language-irrelevant detection and cross-modal matching, which is limited by the isolated architecture. In such a paradigm, the detector needs to sample keypoints from raw point clouds due to the inherent properties of 3D point clouds (irregular and large-scale), to generate the corresponding object proposal for each keypoint. However, sparse proposals may leave out the target in detection, while dense proposals may confuse the matching model. Moreover, the language-irrelevant detection stage can only sample a small proportion of keypoints on the target, deteriorating the target prediction. In this paper, we propose a 3D Single-Stage Referred Point Progressive Selection (3D-SPS) method, which progressively selects keypoints with the guidance of language and directly locates the target. Specifically, we propose a Description-aware Keypoint Sampling (DKS) module to coarsely focus on the points of language-relevant objects, which are significant clues for grounding. Besides, we devise a Target-oriented Progressive Mining (TPM) module to finely concentrate on the points of the target, which is enabled by progressive intra-modal relation modeling and inter-modal target mining. 3D-SPS bridges the gap between detection and matching in the 3D visual grounding task, localizing the target at a single stage. Experiments demonstrate that 3D-SPS achieves state-of-the-art performance on both ScanRefer and Nr3D/Sr3D datasets.

Junyu Luo, Jiahui Fu, Xianghao Kong, Chen Gao, Haibing Ren, Hao Shen, Huaxia Xia, Si Liu• 2022

Related benchmarks

Task	Dataset	Result
3D Visual Grounding	ScanRefer (val)	Overall Accuracy @ IoU 0.5037	253
3D Visual Grounding	ScanRefer	--	142
3D Visual Grounding	Nr3D	Overall Success Rate51.5	97
3D Visual Grounding	Nr3D (test)	Overall Success Rate51.5	88
3D Visual Grounding	Sr3D (test)	Overall Accuracy62.6	73
3D referring expression comprehension	ScanRefer	Overall@0.25 Accuracy48.82	21
3D Point Cloud Affordance Prediction	LASO-C 1.0 (Seen)	aIoU11.4	21
3D Visual Grounding	ScanRefer (test)	--	21
3D Visual Grounding	Sr3D	Overall Accuracy62.6	15
3D Affordance Learning	LASO (Unseen)	aIoU0.079	13

Showing 10 of 19 rows

Other info

Code

Follow for update

@wizwand_team Discord