Concerto: Joint 2D-3D Self-Supervised Learning Emerges Spatial Representations
About
Humans learn abstract concepts through multisensory synergy, and once formed, such representations can often be recalled from a single modality. Inspired by this principle, we introduce Concerto, a minimalist simulation of human concept learning for spatial cognition, combining 3D intra-modal self-distillation with 2D-3D cross-modal joint embedding. Despite its simplicity, Concerto learns more coherent and informative spatial features, as demonstrated by zero-shot visualizations. It outperforms both standalone SOTA 2D and 3D self-supervised models by 14.2% and 4.8%, respectively, as well as their feature concatenation, in linear probing for 3D scene perception. With full fine-tuning, Concerto sets new SOTA results across multiple scene understanding benchmarks (e.g., 80.7% mIoU on ScanNet). We further present a variant of Concerto tailored for video-lifted point cloud spatial understanding, and a translator that linearly projects Concerto representations into CLIP's language space, enabling open-world perception. These results highlight that Concerto emerges spatial representations with superior fine-grained geometric and semantic consistency.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Semantic segmentation | S3DIS (Area 5) | mIOU77.4 | 907 | |
| Semantic segmentation | ScanNet V2 (val) | mIoU80.7 | 316 | |
| Semantic segmentation | ScanNet (val) | mIoU80.7 | 274 | |
| Semantic segmentation | nuScenes (val) | mIoU (Segmentation)0.82 | 265 | |
| Object Classification | ScanObjectNN OBJ_BG | Accuracy92.9 | 223 | |
| Semantic segmentation | SemanticKITTI (val) | mIoU71.2 | 174 | |
| 3D Visual Grounding | ScanRefer | Acc@0.552.6 | 142 | |
| Semantic segmentation | ScanNet200 (val) | mIoU39.2 | 126 | |
| 3D Dense Captioning | Scan2Cap | CIDEr @0.579.6 | 96 | |
| 3D Question Answering | SQA3D | EM60 | 69 |