Exploring Data-Efficient 3D Scene Understanding with Contrastive Scene Contexts
About
The rapid progress in 3D scene understanding has come with growing demand for data; however, collecting and annotating 3D scenes (e.g. point clouds) are notoriously hard. For example, the number of scenes (e.g. indoor rooms) that can be accessed and scanned might be limited; even given sufficient data, acquiring 3D labels (e.g. instance masks) requires intensive human labor. In this paper, we explore data-efficient learning for 3D point cloud. As a first step towards this direction, we propose Contrastive Scene Contexts, a 3D pre-training method that makes use of both point-level correspondences and spatial contexts in a scene. Our method achieves state-of-the-art results on a suite of benchmarks where training data or labels are scarce. Our study reveals that exhaustive labelling of 3D point clouds might be unnecessary; and remarkably, on ScanNet, even using 0.1% of point labels, we still achieve 89% (instance segmentation) and 96% (semantic segmentation) of the baseline performance that uses full annotations.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Semantic segmentation | S3DIS (Area 5) | mIOU72.2 | 799 | |
| 3D Object Detection | ScanNet V2 (val) | -- | 352 | |
| Semantic segmentation | ScanNet V2 (val) | mIoU73.8 | 288 | |
| Semantic segmentation | ScanNet v2 (test) | mIoU73.8 | 248 | |
| Semantic segmentation | ScanNet (val) | mIoU73.8 | 231 | |
| 3D Instance Segmentation | ScanNet V2 (val) | Average AP5059.4 | 195 | |
| 3D Semantic Segmentation | ScanNet V2 (val) | mIoU73.8 | 171 | |
| 3D Object Detection | SUN RGB-D (val) | -- | 158 | |
| 3D Visual Grounding | ScanRefer (val) | -- | 155 | |
| 3D Object Detection | ScanNet | -- | 123 |