Spatiality-guided Transformer for 3D Dense Captioning on Point Clouds
About
Dense captioning in 3D point clouds is an emerging vision-and-language task involving object-level 3D scene understanding. Apart from coarse semantic class prediction and bounding box regression as in traditional 3D object detection, 3D dense captioning aims at producing a further and finer instance-level label of natural language description on visual appearance and spatial relations for each scene object of interest. To detect and describe objects in a scene, following the spirit of neural machine translation, we propose a transformer-based encoder-decoder architecture, namely SpaCap3D, to transform objects into descriptions, where we especially investigate the relative spatiality of objects in 3D scenes and design a spatiality-guided encoder via a token-to-token spatial relation learning objective and an object-centric decoder for precise and spatiality-enhanced object caption generation. Evaluated on two benchmark datasets, ScanRefer and ReferIt3D, our proposed SpaCap3D outperforms the baseline method Scan2Cap by 4.94% and 9.61% in CIDEr@0.5IoU, respectively. Our project page with source code and supplementary files is available at https://SpaCap3D.github.io/ .
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| 3D Dense Captioning | ScanRefer (val) | CIDEr63.3 | 91 | |
| 3D Dense Captioning | Scan2Cap (val) | CIDEr (@0.5)0.44 | 33 | |
| 3D Dense Captioning | ScanRefer (test) | CIDEr63.3 | 30 | |
| 3D Dense Captioning | Nr3D 1 (val) | CIDEr (IoU=0.5)33.71 | 22 | |
| 3D Dense Captioning | ScanRefer | CIDEr@0.5IoU44.02 | 16 | |
| 3D Dense Captioning | ReferIt3D Nr3D (test) | C Score (0.5 IoU)33.71 | 13 | |
| 3D Dense Captioning | Nr3D (test) | C Score @ 0.5 IoU33.71 | 13 | |
| 3D Dense Captioning | Nr3D | C Score (0.5 IoU)33.71 | 9 | |
| 3D Dense Captioning | Nr3D 1 (test) | CIDEr33.71 | 7 | |
| 3D Dense Captioning | ReferIt3D Nr3D (val) | C Score @0.5IoU33.71 | 5 |