Decoupling Common and Unique Representations for Multimodal Self-supervised Learning
About
The increasing availability of multi-sensor data sparks wide interest in multimodal self-supervised learning. However, most existing approaches learn only common representations across modalities while ignoring intra-modal training and modality-unique representations. We propose Decoupling Common and Unique Representations (DeCUR), a simple yet effective method for multimodal self-supervised learning. By distinguishing inter- and intra-modal embeddings through multimodal redundancy reduction, DeCUR can integrate complementary information across different modalities. We evaluate DeCUR in three common multimodal scenarios (radar-optical, RGB-elevation, and RGB-depth), and demonstrate its consistent improvement regardless of architectures and for both multimodal and modality-missing settings. With thorough experiments and comprehensive analysis, we hope this work can provide valuable insights and raise more interest in researching the hidden relationships of multimodal representations.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Segmentation | m-chesapeake | Mean mIoU69.83 | 23 | |
| Field Boundary Segmentation | FTW (test) | Pixel IoU49 | 19 | |
| Flood Inundation Mapping | Sen1Flood11 | mIoU86.87 | 15 | |
| Image Classification | m-forestnet (test) | Mean Accuracy55.9 | 13 | |
| Segmentation | m-nz-cattle | Mean IoU83.04 | 13 | |
| Segmentation | m-cashew-plant | Mean IoU84.15 | 13 | |
| Segmentation | m-NeonTree | Mean mIoU57.47 | 13 | |
| Classification | m-so2sat (test) | Mean Accuracy56.68 | 13 | |
| Segmentation | m-SA crop-type | Mean mIoU34.49 | 13 | |
| Classification | m-pv4ger (test) | Mean Accuracy97.38 | 13 |