Is Contrastive Distillation Enough for Learning Comprehensive 3D Representations?
About
Cross-modal contrastive distillation has recently been explored for learning effective 3D representations. However, existing methods focus primarily on modality-shared features, neglecting the modality-specific features during the pre-training process, which leads to suboptimal representations. In this paper, we theoretically analyze the limitations of current contrastive methods for 3D representation learning and propose a new framework, namely CMCR (Cross-Modal Comprehensive Representation Learning), to address these shortcomings. Our approach improves upon traditional methods by better integrating both modality-shared and modality-specific features. Specifically, we introduce masked image modeling and occupancy estimation tasks to guide the network in learning more comprehensive modality-specific features. Furthermore, we propose a novel multi-modal unified codebook that learns an embedding space shared across different modalities. Besides, we introduce geometry-enhanced masked image modeling to further boost 3D representation learning. Extensive experiments demonstrate that our method mitigates the challenges faced by traditional approaches and consistently outperforms existing image-to-LiDAR contrastive distillation methods in downstream tasks. Code will be available at https://github.com/Eaphan/CMCR.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| 3D Object Detection | nuScenes (val) | mAP52.7 | 128 | |
| Semantic segmentation | nuScenes 1.0 (val) | mIoU76.34 | 81 | |
| Semantic segmentation | semanticKITTI SynLiDAR source (val) | mIoU (Mean IoU)53.58 | 33 | |
| Semantic segmentation | SemanticKITTI v1.0 (val) | mIoU49.86 | 30 | |
| LiDAR Semantic Segmentation | SemanticSTF (val) | mIoU60.71 | 16 | |
| Panoptic Segmentation | nuScenes 1% labels (val) | PQ20.7 | 16 | |
| Semantic segmentation | ScribbleKITTI (val) | mIoU55.36 | 12 | |
| Semantic segmentation | RELLIS-3D (val) | mIoU56.4 | 12 | |
| Semantic segmentation | SemanticPOSS (val) | mIoU58.63 | 12 | |
| Semantic segmentation | DAPS-3D (val) | mIoU87.29 | 12 |