EDA: Explicit Text-Decoupling and Dense Alignment for 3D Visual Grounding
About
3D visual grounding aims to find the object within point clouds mentioned by free-form natural language descriptions with rich semantic cues. However, existing methods either extract the sentence-level features coupling all words or focus more on object names, which would lose the word-level information or neglect other attributes. To alleviate these issues, we present EDA that Explicitly Decouples the textual attributes in a sentence and conducts Dense Alignment between such fine-grained language and point cloud objects. Specifically, we first propose a text decoupling module to produce textual features for every semantic component. Then, we design two losses to supervise the dense matching between two modalities: position alignment loss and semantic alignment loss. On top of that, we further introduce a new visual grounding task, locating objects without object names, which can thoroughly evaluate the model's dense alignment capacity. Through experiments, we achieve state-of-the-art performance on two widely-adopted 3D visual grounding datasets, ScanRefer and SR3D/NR3D, and obtain absolute leadership on our newly-proposed task. The source code is available at https://github.com/yanmin-wu/EDA.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| 3D Visual Grounding | ScanRefer (val) | Overall Accuracy @ IoU 0.5042.3 | 155 | |
| 3D Visual Grounding | Nr3D (test) | -- | 88 | |
| 3D Visual Grounding | Sr3D (test) | Overall Accuracy68.1 | 73 | |
| 3D Visual Grounding | ScanRefer Unique | Acc@0.25 (IoU=0.25)85.8 | 24 | |
| 3D Visual Grounding | ScanRefer Multiple (val) | Accuracy @ IoU 0.2549.1 | 15 | |
| Referring Expression Segmentation | ScanRefer | mIoU36.2 | 9 | |
| Referring Expression Segmentation | ReferIt3D Nr3D | mIoU29.3 | 7 | |
| Referring Expression Segmentation | MultiRefer3D | mIoU28.9 | 5 |