Masked Point-Entity Contrast for Open-Vocabulary 3D Scene Understanding
About
Open-vocabulary 3D scene understanding is pivotal for enhancing physical intelligence, as it enables embodied agents to interpret and interact dynamically within real-world environments. This paper introduces MPEC, a novel Masked Point-Entity Contrastive learning method for open-vocabulary 3D semantic segmentation that leverages both 3D entity-language alignment and point-entity consistency across different point cloud views to foster entity-specific feature representations. Our method improves semantic discrimination and enhances the differentiation of unique instances, achieving state-of-the-art results on ScanNet for open-vocabulary 3D semantic segmentation and demonstrating superior zero-shot scene understanding capabilities. Extensive fine-tuning experiments on 8 datasets, spanning from low-level perception to high-level reasoning tasks, showcase the potential of learned 3D features, driving consistent performance gains across varied 3D scene understanding tasks. Project website: https://mpec-3d.github.io/
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Semantic segmentation | ScanNet V2 (val) | mIoU75.8 | 288 | |
| 3D Visual Grounding | Nr3D (test) | Overall Success Rate66.7 | 88 | |
| Instance Segmentation | ScanNetV2 (val) | mAP@0.562.5 | 58 | |
| Semantic segmentation | ScanNet200 v1 (val) | mIoU31.8 | 19 | |
| 3D Semantic Segmentation | ScanNet200 (test) | mIoU (f)10.8 | 15 | |
| Open-Vocabulary 3D Semantic Segmentation | ScanNet 14 (val) | f-mAcc81.3 | 13 | |
| Instance Segmentation | ScanNet200 v1 (val) | mAP@0.531.6 | 6 | |
| 3D Semantic Segmentation | SceneVerse (val) | f-mIoU45 | 3 |