Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Is Contrastive Distillation Enough for Learning Comprehensive 3D Representations?

About

Cross-modal contrastive distillation has recently been explored for learning effective 3D representations. However, existing methods focus primarily on modality-shared features, neglecting the modality-specific features during the pre-training process, which leads to suboptimal representations. In this paper, we theoretically analyze the limitations of current contrastive methods for 3D representation learning and propose a new framework, namely CMCR (Cross-Modal Comprehensive Representation Learning), to address these shortcomings. Our approach improves upon traditional methods by better integrating both modality-shared and modality-specific features. Specifically, we introduce masked image modeling and occupancy estimation tasks to guide the network in learning more comprehensive modality-specific features. Furthermore, we propose a novel multi-modal unified codebook that learns an embedding space shared across different modalities. Besides, we introduce geometry-enhanced masked image modeling to further boost 3D representation learning. Extensive experiments demonstrate that our method mitigates the challenges faced by traditional approaches and consistently outperforms existing image-to-LiDAR contrastive distillation methods in downstream tasks. Code will be available at https://github.com/Eaphan/CMCR.

Yifan Zhang, Junhui Hou• 2024

Related benchmarks

TaskDatasetResultRank
3D Object DetectionnuScenes (val)
mAP52.7
128
Semantic segmentationnuScenes 1.0 (val)
mIoU76.34
81
Semantic segmentationsemanticKITTI SynLiDAR source (val)
mIoU (Mean IoU)53.58
33
Semantic segmentationSemanticKITTI v1.0 (val)
mIoU49.86
30
LiDAR Semantic SegmentationSemanticSTF (val)
mIoU60.71
16
Panoptic SegmentationnuScenes 1% labels (val)
PQ20.7
16
Semantic segmentationScribbleKITTI (val)
mIoU55.36
12
Semantic segmentationRELLIS-3D (val)
mIoU56.4
12
Semantic segmentationSemanticPOSS (val)
mIoU58.63
12
Semantic segmentationDAPS-3D (val)
mIoU87.29
12
Showing 10 of 15 rows

Other info

Follow for update