Decoupling Common and Unique Representations for Multimodal Self-supervised Learning
About
The increasing availability of multi-sensor data sparks wide interest in multimodal self-supervised learning. However, most existing approaches learn only common representations across modalities while ignoring intra-modal training and modality-unique representations. We propose Decoupling Common and Unique Representations (DeCUR), a simple yet effective method for multimodal self-supervised learning. By distinguishing inter- and intra-modal embeddings through multimodal redundancy reduction, DeCUR can integrate complementary information across different modalities. We evaluate DeCUR in three common multimodal scenarios (radar-optical, RGB-elevation, and RGB-depth), and demonstrate its consistent improvement regardless of architectures and for both multimodal and modality-missing settings. With thorough experiments and comprehensive analysis, we hope this work can provide valuable insights and raise more interest in researching the hidden relationships of multimodal representations.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Segmentation | m-SA crop-type | Mean mIoU34.49 | 27 | |
| Segmentation | m-chesapeake | Mean mIoU69.83 | 23 | |
| Classification | m-so2sat GEO-Bench | Overall Accuracy61.7 | 22 | |
| Classification | m-eurosat GEO-Bench | Overall Accuracy97.9 | 20 | |
| Classification | m-brick-kiln GEO-Bench | Overall Accuracy (OA)98.7 | 20 | |
| Field Boundary Segmentation | FTW (test) | Pixel IoU49 | 19 | |
| Classification | m-so2sat (test) | Mean Accuracy56.68 | 17 | |
| Flood Inundation Mapping | Sen1Flood11 | mIoU86.87 | 15 | |
| Multi-Label Classification | m-bigearthnet GeoBench | F1 Score70.9 | 14 | |
| Segmentation | m-cashew GeoBench | mIoU84.15 | 14 |