Cross-Scale MAE: A Tale of Multi-Scale Exploitation in Remote Sensing
About
Remote sensing images present unique challenges to image analysis due to the extensive geographic coverage, hardware limitations, and misaligned multi-scale images. This paper revisits the classical multi-scale representation learning problem but under the general framework of self-supervised learning for remote sensing image understanding. We present Cross-Scale MAE, a self-supervised model built upon the Masked Auto-Encoder (MAE).During pre-training, Cross-Scale MAE employs scale augmentation techniques and enforces cross-scale consistency constraints through both contrastive and generative losses to ensure consistent and meaningful representations well-suited for a wide range of downstream tasks. Further, our implementation leverages the xFormers library to accelerate network pre-training on a single GPU while maintaining the quality of learned representations. Experimental evaluations demonstrate that Cross-Scale MAE exhibits superior performance compared to standard MAE and other state-of-the-art remote sensing MAE methods.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Image Classification | EuroSAT | Accuracy84.01 | 569 | |
| Semantic segmentation | Vaihingen | mIoU76.03 | 140 | |
| Semantic segmentation | Potsdam | mIoU76.17 | 81 | |
| Image Classification | WHU-RS19 | Accuracy79.8 | 60 | |
| Image Classification | fMoW (val) | Accuracy71.4 | 34 | |
| Image Classification | UC Merced | Accuracy (KNN)93.1 | 31 | |
| Image Classification | RESISC-45 (val) | Top-1 Accuracy91.1 | 22 | |
| Image Classification | FireRisk (val) | Accuracy61.6 | 20 | |
| Image Classification | ForestNet (val) | Accuracy49.7 | 20 | |
| Semantic segmentation | PASTIS-HD (val) | mIoU31.4 | 20 |