CROMA: Remote Sensing Representations with Contrastive Radar-Optical Masked Autoencoders
About
A vital and rapidly growing application, remote sensing offers vast yet sparsely labeled, spatially aligned multimodal data; this makes self-supervised learning algorithms invaluable. We present CROMA: a framework that combines contrastive and reconstruction self-supervised objectives to learn rich unimodal and multimodal representations. Our method separately encodes masked-out multispectral optical and synthetic aperture radar samples -- aligned in space and time -- and performs cross-modal contrastive learning. Another encoder fuses these sensors, producing joint multimodal encodings that are used to predict the masked patches via a lightweight decoder. We show that these objectives are complementary when leveraged on spatially aligned multimodal data. We also introduce X- and 2D-ALiBi, which spatially biases our cross- and self-attention matrices. These strategies improve representations and allow our models to effectively extrapolate to images up to 17.6x larger at test-time. CROMA outperforms the current SoTA multispectral model, evaluated on: four classification benchmarks -- finetuning (avg. 1.8%), linear (avg. 2.4%) and nonlinear (avg. 1.4%) probing, kNN classification (avg. 3.5%), and K-means clustering (avg. 8.4%); and three segmentation benchmarks (avg. 6.4%). CROMA's rich, optionally multimodal representations can be widely leveraged across remote sensing applications.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Image Classification | ImageNet-1K | Top-1 Acc80 | 836 | |
| Image Classification | ImageNet-1k (val) | Top-1 Accuracy80 | 512 | |
| Action Recognition | UCF101 | Accuracy41.6 | 365 | |
| Image Classification | ImageNet 1k (test) | Top-1 Accuracy80 | 359 | |
| Object Detection | COCO | mAP33.9 | 107 | |
| Change Detection | LEVIR | F1 Score88.5 | 62 | |
| 3D Object Classification | ModelNet40 | -- | 62 | |
| Semantic segmentation | ScanNet | mIoU70.6 | 59 | |
| Change Detection | OSCD | -- | 26 | |
| Semantic segmentation | SN-7-TS (test) | mIoU59.28 | 24 |