Masked Point-Entity Contrast for Open-Vocabulary 3D Scene Understanding

About

Open-vocabulary 3D scene understanding is pivotal for enhancing physical intelligence, as it enables embodied agents to interpret and interact dynamically within real-world environments. This paper introduces MPEC, a novel Masked Point-Entity Contrastive learning method for open-vocabulary 3D semantic segmentation that leverages both 3D entity-language alignment and point-entity consistency across different point cloud views to foster entity-specific feature representations. Our method improves semantic discrimination and enhances the differentiation of unique instances, achieving state-of-the-art results on ScanNet for open-vocabulary 3D semantic segmentation and demonstrating superior zero-shot scene understanding capabilities. Extensive fine-tuning experiments on 8 datasets, spanning from low-level perception to high-level reasoning tasks, showcase the potential of learned 3D features, driving consistent performance gains across varied 3D scene understanding tasks. Project website: https://mpec-3d.github.io/

Yan Wang, Baoxiong Jia, Ziyu Zhu, Siyuan Huang• 2025

Related benchmarks

Task	Dataset	Result
Semantic segmentation	ScanNet V2 (val)	mIoU75.8	380
3D Visual Grounding	Nr3D (test)	Overall Success Rate66.7	88
Instance Segmentation	ScanNetV2 (val)	mAP@0.562.5	58
Semantic segmentation	ScanNet200 v1 (val)	mIoU31.8	19
3D Semantic Segmentation	ScanNet200 (test)	mIoU (f)10.8	15
Open-Vocabulary 3D Semantic Segmentation	ScanNet 14 (val)	f-mAcc81.3	13
Instance Segmentation	ScanNet200 v1 (val)	mAP@0.531.6	6
3D Semantic Segmentation	SceneVerse (val)	f-mIoU45	3

Showing 8 of 8 rows

Other info

Code

Follow for update

@wizwand_team Discord