Mosaic3D: Foundation Dataset and Model for Open-Vocabulary 3D Segmentation

About

We tackle open-vocabulary 3D scene understanding by introducing a novel data generation pipeline and training framework. Our method addresses three critical requirements for effective training: precise 3D region segmentation, comprehensive textual descriptions, and sufficient dataset scale. By leveraging state-of-the-art open-vocabulary image segmentation models and region-aware Vision-Language Models, we develop an automatic pipeline that generates high-quality 3D mask-text pairs. Applying this pipeline to multiple 3D scene datasets, we create Mosaic3D-5.6M, a dataset of over 30K annotated scenes with 5.6M mask-text pairs, significantly larger than existing datasets. Building upon this data, we propose Mosaic3D, a foundation model combining a 3D encoder trained with contrastive learning and a lightweight mask decoder for open-vocabulary 3D semantic and instance segmentation. Our approach achieves state-of-the-art results on open-vocabulary 3D semantic and instance segmentation tasks including ScanNet200, Matterport3D, and ScanNet++, with ablation studies validating the effectiveness of our large-scale training data.

Junha Lee, Chunghyun Park, Jaesung Choe, Yu-Chiang Frank Wang, Jan Kautz, Minsu Cho, Chris Choy• 2025

Related benchmarks

Task	Dataset	Result
3D Instance Segmentation	ScanNet200 (val)	mAP11.8	85
3D Instance Segmentation	ScanNet200	mAP@0.516	63
3D Semantic Segmentation	ScanNet V2	mIoU48.9	35
3D Semantic Segmentation	ScanNet200	mIoU12.4	28
3D Semantic Segmentation	ScanNet200 (val)	mIoU (All Classes)13.1	25
3D Semantic Segmentation	ScanNet200 (test)	mIoU (f)15.7	15
3D Semantic Segmentation	ScanNet40 (val)	mIoU35.7	11
3D Semantic Segmentation	Matterport3D 160 classes (test)	f-mIoU13.1	8
3D Semantic Segmentation	ScanNet++ 100 classes (test)	f-mIoU18	8
3D Semantic Segmentation	ScanNet 20 (val)	mIoU50.3	7

Showing 10 of 11 rows

Other info

Follow for update

@wizwand_team Discord