Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

EDA: Explicit Text-Decoupling and Dense Alignment for 3D Visual Grounding

About

3D visual grounding aims to find the object within point clouds mentioned by free-form natural language descriptions with rich semantic cues. However, existing methods either extract the sentence-level features coupling all words or focus more on object names, which would lose the word-level information or neglect other attributes. To alleviate these issues, we present EDA that Explicitly Decouples the textual attributes in a sentence and conducts Dense Alignment between such fine-grained language and point cloud objects. Specifically, we first propose a text decoupling module to produce textual features for every semantic component. Then, we design two losses to supervise the dense matching between two modalities: position alignment loss and semantic alignment loss. On top of that, we further introduce a new visual grounding task, locating objects without object names, which can thoroughly evaluate the model's dense alignment capacity. Through experiments, we achieve state-of-the-art performance on two widely-adopted 3D visual grounding datasets, ScanRefer and SR3D/NR3D, and obtain absolute leadership on our newly-proposed task. The source code is available at https://github.com/yanmin-wu/EDA.

Yanmin Wu, Xinhua Cheng, Renrui Zhang, Zesen Cheng, Jian Zhang• 2022

Related benchmarks

TaskDatasetResultRank
3D Visual GroundingScanRefer (val)
Overall Accuracy @ IoU 0.5042.3
155
3D Visual GroundingNr3D (test)--
88
3D Visual GroundingSr3D (test)
Overall Accuracy68.1
73
3D Visual GroundingScanRefer Unique
Acc@0.25 (IoU=0.25)85.8
24
3D Visual GroundingScanRefer Multiple (val)
Accuracy @ IoU 0.2549.1
15
Referring Expression SegmentationScanRefer
mIoU36.2
9
Referring Expression SegmentationReferIt3D Nr3D
mIoU29.3
7
Referring Expression SegmentationMultiRefer3D
mIoU28.9
5
Showing 8 of 8 rows

Other info

Follow for update