ScanRefer: 3D Object Localization in RGB-D Scans using Natural Language
About
We introduce the task of 3D object localization in RGB-D scans using natural language descriptions. As input, we assume a point cloud of a scanned 3D scene along with a free-form description of a specified target object. To address this task, we propose ScanRefer, learning a fused descriptor from 3D object proposals and encoded sentence embeddings. This fused descriptor correlates language expressions with geometric features, enabling regression of the 3D bounding box of a target object. We also introduce the ScanRefer dataset, containing 51,583 descriptions of 11,046 objects from 800 ScanNet scenes. ScanRefer is the first large-scale effort to perform object localization via natural language expression directly in 3D.
Dave Zhenyu Chen, Angel X. Chang, Matthias Nie{\ss}ner• 2019
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| 3D Visual Grounding | ScanRefer (val) | Overall Accuracy @ IoU 0.5043.31 | 155 | |
| 3D Question Answering | ScanQA (val) | CIDEr64.9 | 133 | |
| 3D Visual Grounding | Nr3D (test) | Overall Success Rate34.2 | 88 | |
| 3D Visual Grounding | Nr3D | Overall Success Rate34.2 | 74 | |
| 3D Question Answering | ScanQA w/ objects (test) | EM@120.56 | 55 | |
| 3D Question Answering | ScanQA w/o objects (test) | EM@119.04 | 51 | |
| Visual Grounding | ScanRefer v1 (val) | -- | 30 | |
| 3D Visual Grounding | ScanRefer Unique | Acc@0.25 (IoU=0.25)67.6 | 24 | |
| 3D Visual Grounding | ScanRefer | Acc@0.2537.3 | 23 | |
| 3D Question Answering | ScanQA (test) | BLEU-47.5 | 20 |
Showing 10 of 26 rows