PLA: Language-Driven Open-Vocabulary 3D Scene Understanding

About

Open-vocabulary scene understanding aims to localize and recognize unseen categories beyond the annotated label space. The recent breakthrough of 2D open-vocabulary perception is largely driven by Internet-scale paired image-text data with rich vocabulary concepts. However, this success cannot be directly transferred to 3D scenarios due to the inaccessibility of large-scale 3D-text pairs. To this end, we propose to distill knowledge encoded in pre-trained vision-language (VL) foundation models through captioning multi-view images from 3D, which allows explicitly associating 3D and semantic-rich captions. Further, to foster coarse-to-fine visual-semantic representation learning from captions, we design hierarchical 3D-caption pairs, leveraging geometric constraints between 3D scenes and multi-view images. Finally, by employing contrastive learning, the model learns language-aware embeddings that connect 3D and text for open-vocabulary tasks. Our method not only remarkably outperforms baseline methods by 25.8% $\sim$ 44.7% hIoU and 14.5% $\sim$ 50.4% hAP$_{50}$ in open-vocabulary semantic and instance segmentation, but also shows robust transferability on challenging zero-shot domain transfer tasks. See the project website at https://dingry.github.io/projects/PLA.

Runyu Ding, Jihan Yang, Chuhui Xue, Wenqing Zhang, Song Bai, Xiaojuan Qi• 2022

Related benchmarks

Task	Dataset	Result
3D Semantic Segmentation	ScanNet V2 (val)	mIoU17.7	209
3D Semantic Segmentation	ScanNet B12 N7	hIoU5.53e+3	20
3D Semantic Segmentation	ScanNet B10/N9	hIoU59.2	20
3D Semantic Segmentation	S3DIS (B8/N4)	hIoU3.46e+3	19
3D Semantic Segmentation	S3DIS B6 N6	hIoU46.7	19
3D Semantic Segmentation	ScanNet200 (test)	mIoU (f)1.8	15
3D Semantic Segmentation	ScanNet B15 N4	hIoU70.3	13
3D Instance Segmentation	S3DIS (B8/N4)	mAP50 (Base)60.3	13
3D Instance Segmentation	S3DIS B6 N6	mAP50 (Base)49.2	13
Open-Vocabulary 3D Semantic Segmentation	ScanNet 14 (val)	f-mAcc41.5	13

Showing 10 of 36 rows

Other info

Code

Follow for update

@wizwand_team Discord