Dense Multimodal Alignment for Open-Vocabulary 3D Scene Understanding

About

Recent vision-language pre-training models have exhibited remarkable generalization ability in zero-shot recognition tasks. Previous open-vocabulary 3D scene understanding methods mostly focus on training 3D models using either image or text supervision while neglecting the collective strength of all modalities. In this work, we propose a Dense Multimodal Alignment (DMA) framework to densely co-embed different modalities into a common space for maximizing their synergistic benefits. Instead of extracting coarse view- or region-level text prompts, we leverage large vision-language models to extract complete category information and scalable scene descriptions to build the text modality, and take image modality as the bridge to build dense point-pixel-text associations. Besides, in order to enhance the generalization ability of the 2D model for downstream 3D tasks without compromising the open-vocabulary capability, we employ a dual-path integration approach to combine frozen CLIP visual features and learnable mask features. Extensive experiments show that our DMA method produces highly competitive open-vocabulary segmentation performance on various indoor and outdoor tasks.

Ruihuang Li, Zhengqiang Zhang, Chenhang He, Zhiyuan Ma, Vishal M. Patel, Lei Zhang• 2024

Related benchmarks

Task	Dataset	Result
3D Semantic Segmentation	ScanNet V2 (val)	mIoU53.3	209
3D Semantic Segmentation	Matterport3D (test)	mIoU45.1	32
3D Semantic Segmentation	Matterport3D (val)	mIoU39.8	31
3D Semantic Segmentation	nuScenes 1.0 (val)	mIoU45.1	19
3D Semantic Segmentation	Matterport3D K=40 (test)	mIoU37.9	17
3D Semantic Segmentation	Matterport3D K=80 (test)	mIoU19.7	17
3D Semantic Segmentation	Matterport3D K=160 (test)	mIoU9.4	17
3D Semantic Segmentation	Matterport3D 1.0 (test)	mAcc57.6	14

Showing 8 of 8 rows

Other info

Follow for update

@wizwand_team Discord