Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

SARVLM: A Vision Language Foundation Model for Semantic Understanding in SAR Imagery

About

Synthetic Aperture Radar (SAR) is a critical imaging modality due to its all-weather operational capability. Although recent advances in self-supervised learning and masked image modeling (MIM) have enabled SAR foundation models, these approaches primarily focus on low-level visual features and often neglect multi-modal representation. Moreover, multimodal data for SAR is scarce, limiting the development of robust cross-modal models. To address this limitation, we construct SARVLM-1M, a large-scale vision-language dataset comprising over one million image-text pairs aggregated from existing datasets. Furthermore, to mitigate the substantial differences between SAR and natural imagery, we propose a two-stage domain transfer training strategy that leverages optical remote sensing data as an intermediate bridge, facilitating effective knowledge transfer from natural images to SAR domains. Based on this strategy, we develop SARVLM, the first vision-language foundation model tailored for SAR, consisting of SARCLIP and SARCap. In addition, an ensemble strategy is utilized to improve the cross-scene generalization capability of the model. Moreover, SARDet and SARRot further validate the capability of the proposed framework in object detection. Extensive experiments on 13 benchmarks across image-text retrieval, target recognition, zero-shot classification, object detection, semantic localization, and image captioning demonstrate the superior feature extraction and interpretation capabilities of SARVLM. It consistently outperforms state-of-the-art vision-language models and advances semantic understanding in SAR imagery. Code and datasets will be released on https://github.com/KlayMa527/SARVLM.git.

Qiwei Ma, Xukun Lu, Wang Liu, Puhong Duan, Xudong Kang, Shutao Li• 2025

Related benchmarks

TaskDatasetResultRank
SAR Object DetectionSSDD
mAP5092.1
44
Bidirectional RetrievalSARVLM-1M (test)
Mean Recall30.94
25
Image-to-Text RetrievalSARVLM-1M (test)
R@112.66
25
Text-to-Image RetrievalSARVLM-1M (test)
R@113.58
25
Object DetectionSARDet-100K
mAP60.2
23
Aircraft detectionSAR-Aircraft
mAP5086.2
11
Target RecognitionMSTAR SOC
Accuracy86.55
7
Target RecognitionSAR-VSA
Accuracy87.57
7
Zero-shot ClassificationMSTAR SOC
Accuracy (Zero-shot)60.45
7
Zero-shot ClassificationSAR-VSA
Accuracy (Zero-shot)40.79
7
Showing 10 of 15 rows

Other info

Follow for update