Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Concerto: Joint 2D-3D Self-Supervised Learning Emerges Spatial Representations

About

Humans learn abstract concepts through multisensory synergy, and once formed, such representations can often be recalled from a single modality. Inspired by this principle, we introduce Concerto, a minimalist simulation of human concept learning for spatial cognition, combining 3D intra-modal self-distillation with 2D-3D cross-modal joint embedding. Despite its simplicity, Concerto learns more coherent and informative spatial features, as demonstrated by zero-shot visualizations. It outperforms both standalone SOTA 2D and 3D self-supervised models by 14.2% and 4.8%, respectively, as well as their feature concatenation, in linear probing for 3D scene perception. With full fine-tuning, Concerto sets new SOTA results across multiple scene understanding benchmarks (e.g., 80.7% mIoU on ScanNet). We further present a variant of Concerto tailored for video-lifted point cloud spatial understanding, and a translator that linearly projects Concerto representations into CLIP's language space, enabling open-world perception. These results highlight that Concerto emerges spatial representations with superior fine-grained geometric and semantic consistency.

Yujia Zhang, Xiaoyang Wu, Yixing Lao, Chengyao Wang, Zhuotao Tian, Naiyan Wang, Hengshuang Zhao• 2025

Related benchmarks

TaskDatasetResultRank
Semantic segmentationS3DIS (Area 5)
mIOU77.4
907
Semantic segmentationScanNet V2 (val)
mIoU80.7
316
Semantic segmentationScanNet (val)
mIoU80.7
274
Semantic segmentationnuScenes (val)
mIoU (Segmentation)0.82
265
Object ClassificationScanObjectNN OBJ_BG
Accuracy92.9
223
Semantic segmentationSemanticKITTI (val)
mIoU71.2
174
3D Visual GroundingScanRefer
Acc@0.552.6
142
Semantic segmentationScanNet200 (val)
mIoU39.2
126
3D Dense CaptioningScan2Cap
CIDEr @0.579.6
96
3D Question AnsweringSQA3D
EM60
69
Showing 10 of 29 rows

Other info

Follow for update