Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

3D-SPS: Single-Stage 3D Visual Grounding via Referred Point Progressive Selection

About

3D visual grounding aims to locate the referred target object in 3D point cloud scenes according to a free-form language description. Previous methods mostly follow a two-stage paradigm, i.e., language-irrelevant detection and cross-modal matching, which is limited by the isolated architecture. In such a paradigm, the detector needs to sample keypoints from raw point clouds due to the inherent properties of 3D point clouds (irregular and large-scale), to generate the corresponding object proposal for each keypoint. However, sparse proposals may leave out the target in detection, while dense proposals may confuse the matching model. Moreover, the language-irrelevant detection stage can only sample a small proportion of keypoints on the target, deteriorating the target prediction. In this paper, we propose a 3D Single-Stage Referred Point Progressive Selection (3D-SPS) method, which progressively selects keypoints with the guidance of language and directly locates the target. Specifically, we propose a Description-aware Keypoint Sampling (DKS) module to coarsely focus on the points of language-relevant objects, which are significant clues for grounding. Besides, we devise a Target-oriented Progressive Mining (TPM) module to finely concentrate on the points of the target, which is enabled by progressive intra-modal relation modeling and inter-modal target mining. 3D-SPS bridges the gap between detection and matching in the 3D visual grounding task, localizing the target at a single stage. Experiments demonstrate that 3D-SPS achieves state-of-the-art performance on both ScanRefer and Nr3D/Sr3D datasets.

Junyu Luo, Jiahui Fu, Xianghao Kong, Chen Gao, Haibing Ren, Hao Shen, Huaxia Xia, Si Liu• 2022

Related benchmarks

TaskDatasetResultRank
3D Visual GroundingScanRefer (val)
Overall Accuracy @ IoU 0.5037
155
3D Visual GroundingNr3D (test)
Overall Success Rate51.5
88
3D Visual GroundingNr3D
Overall Success Rate51.5
74
3D Visual GroundingSr3D (test)
Overall Accuracy62.6
73
3D Visual GroundingScanRefer--
23
3D Point Cloud Affordance PredictionLASO-C 1.0 (Seen)
aIoU11.4
21
3D Visual GroundingScanRefer (test)--
21
3D Object GroundingScanRefer detected proposals v1 (val)
Unique Acc@0.2581.63
10
3D Visual GroundingSr3D
Overall Accuracy62.6
7
3D Affordance LearningLASO (Unseen)
aIoU0.079
7
Showing 10 of 13 rows

Other info

Code

Follow for update