OpenScene: 3D Scene Understanding with Open Vocabularies
About
Traditional 3D scene understanding approaches rely on labeled 3D datasets to train a model for a single task with supervision. We propose OpenScene, an alternative approach where a model predicts dense features for 3D scene points that are co-embedded with text and image pixels in CLIP feature space. This zero-shot approach enables task-agnostic training and open-vocabulary queries. For example, to perform SOTA zero-shot 3D semantic segmentation it first infers CLIP features for every 3D point and later classifies them based on similarities to embeddings of arbitrary class labels. More interestingly, it enables a suite of open-vocabulary scene understanding applications that have never been done before. For example, it allows a user to enter an arbitrary text query and then see a heat map indicating which parts of a scene match. Our approach is effective at identifying objects, materials, affordances, activities, and room types in complex 3D scenes, all using a single model trained without any labeled 3D data.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Semantic segmentation | ScanNet V2 (val) | mIoU54.2 | 316 | |
| Semantic segmentation | nuScenes (val) | mIoU (Segmentation)0.421 | 265 | |
| Semantic segmentation | ScanNet v2 (test) | mIoU54.2 | 248 | |
| 3D Semantic Segmentation | ScanNet V2 (val) | mIoU62.8 | 209 | |
| 3D Visual Grounding | ScanRefer (val) | Overall Accuracy @ IoU 0.506.5 | 192 | |
| LiDAR Semantic Segmentation | nuScenes (val) | mIoU42.1 | 169 | |
| 3D Semantic Segmentation | ScanNet (val) | mIoU47 | 144 | |
| Instance Segmentation | ScanNet200 (val) | mAP@5015.2 | 72 | |
| 3D Instance Segmentation | ScanNet200 | mAP@0.56.2 | 63 | |
| 3D Instance Segmentation | ScanNet200 (val) | mAP11.7 | 55 |