Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

SpaCeFormer: Fast Proposal-Free Open-Vocabulary 3D Instance Segmentation

About

Open-vocabulary 3D instance segmentation is a core capability for robotics and AR/VR, but prior methods trade one bottleneck for another: multi-stage 2D+3D pipelines aggregate foundation-model outputs at hundreds of seconds per scene, while pseudo-labeled end-to-end approaches rely on fragmented masks and external region proposals. We present SpaCeFormer, a proposal-free space-curve transformer that runs in 0.12--0.30 seconds per scene across standard benchmarks, 2--3 orders of magnitude faster than multi-stage 2D+3D pipelines. We pair it with SpaCeFormer-3M, the largest open-vocabulary 3D instance segmentation dataset (3.0M multi-view-consistent captions over 604K instances from 7.4K scenes) built through multi-view mask clustering and multi-view VLM captioning; it reaches 21$\times$ higher mask recall than prior single-view pipelines (54.3% vs 2.5% at IoU$>$0.5). SpaCeFormer combines spatial window attention with Morton-curve serialization for spatially coherent features, and uses a RoPE-enhanced decoder to predict instance masks directly from learned queries without external proposals. On ScanNet200 we achieve 11.1 zero-shot mAP, a 2.8$\times$ improvement over the prior best proposal-free method; on ScanNet++ and Replica, we reach 22.9 and 24.1 mAP, surpassing all prior methods including those using multi-view 2D inputs.

Chris Choy, Junha Lee, Chunghyun Park, Minsu Cho, Jan Kautz• 2026

Related benchmarks

TaskDatasetResultRank
3D Instance SegmentationScanNet200 (val)
mAP16.7
78
Class-agnostic 3D instance segmentationScanNet200 (val)
AP22.5
19
3D Instance SegmentationReplica 8 scenes
mAP24.1
16
3D Instance SegmentationScanNet++ 100 classes (val)
mAP22.9
9
Class-agnostic instance segmentationScanNet++ 100 classes (test)
AP29.8
7
Class-agnostic instance segmentationReplica 8 scenes (test)
AP33.2
1
Showing 6 of 6 rows

Other info

Follow for update