SARVLM: A Vision Language Foundation Model for Semantic Understanding in SAR Imagery
About
Synthetic Aperture Radar (SAR) is a critical imaging modality due to its all-weather operational capability. Although recent advances in self-supervised learning and masked image modeling (MIM) have enabled SAR foundation models, these approaches primarily focus on low-level visual features and often neglect multi-modal representation. Moreover, multimodal data for SAR is scarce, limiting the development of robust cross-modal models. To address this limitation, we construct SARVLM-1M, a large-scale vision-language dataset comprising over one million image-text pairs aggregated from existing datasets. Furthermore, to mitigate the substantial differences between SAR and natural imagery, we propose a two-stage domain transfer training strategy that leverages optical remote sensing data as an intermediate bridge, facilitating effective knowledge transfer from natural images to SAR domains. Based on this strategy, we develop SARVLM, the first vision-language foundation model tailored for SAR, consisting of SARCLIP and SARCap. In addition, an ensemble strategy is utilized to improve the cross-scene generalization capability of the model. Moreover, SARDet and SARRot further validate the capability of the proposed framework in object detection. Extensive experiments on 13 benchmarks across image-text retrieval, target recognition, zero-shot classification, object detection, semantic localization, and image captioning demonstrate the superior feature extraction and interpretation capabilities of SARVLM. It consistently outperforms state-of-the-art vision-language models and advances semantic understanding in SAR imagery. Code and datasets will be released on https://github.com/KlayMa527/SARVLM.git.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| SAR Object Detection | SSDD | mAP5092.1 | 44 | |
| Bidirectional Retrieval | SARVLM-1M (test) | Mean Recall30.94 | 25 | |
| Image-to-Text Retrieval | SARVLM-1M (test) | R@112.66 | 25 | |
| Text-to-Image Retrieval | SARVLM-1M (test) | R@113.58 | 25 | |
| Object Detection | SARDet-100K | mAP60.2 | 23 | |
| Aircraft detection | SAR-Aircraft | mAP5086.2 | 11 | |
| Target Recognition | MSTAR SOC | Accuracy86.55 | 7 | |
| Target Recognition | SAR-VSA | Accuracy87.57 | 7 | |
| Zero-shot Classification | MSTAR SOC | Accuracy (Zero-shot)60.45 | 7 | |
| Zero-shot Classification | SAR-VSA | Accuracy (Zero-shot)40.79 | 7 |