SARVLM: A Vision Language Foundation Model for Semantic Understanding in SAR Imagery

About

Synthetic Aperture Radar (SAR) is a critical imaging modality due to its all-weather operational capability. Although recent advances in self-supervised learning and masked image modeling (MIM) have enabled SAR foundation models, these approaches primarily focus on low-level visual features and often neglect multi-modal representation. Moreover, multimodal data for SAR is scarce, limiting the development of robust cross-modal models. To address this limitation, we construct SARVLM-1M, a large-scale vision-language dataset comprising over one million image-text pairs aggregated from existing datasets. Furthermore, to mitigate the substantial differences between SAR and natural imagery, we propose a two-stage domain transfer training strategy that leverages optical remote sensing data as an intermediate bridge, facilitating effective knowledge transfer from natural images to SAR domains. Based on this strategy, we develop SARVLM, the first vision-language foundation model tailored for SAR, consisting of SARCLIP and SARCap. In addition, an ensemble strategy is utilized to improve the cross-scene generalization capability of the model. Moreover, SARDet and SARRot further validate the capability of the proposed framework in object detection. Extensive experiments on 13 benchmarks across image-text retrieval, target recognition, zero-shot classification, object detection, semantic localization, and image captioning demonstrate the superior feature extraction and interpretation capabilities of SARVLM. It consistently outperforms state-of-the-art vision-language models and advances semantic understanding in SAR imagery. Code and datasets will be released on https://github.com/KlayMa527/SARVLM.git.

Qiwei Ma, Xukun Lu, Wang Liu, Puhong Duan, Xudong Kang, Shutao Li• 2025

Related benchmarks

Task	Dataset	Result
SAR Object Detection	SSDD	mAP5092.1	44
Bidirectional Retrieval	SARVLM-1M (test)	Mean Recall30.94	25
Image-to-Text Retrieval	SARVLM-1M (test)	R@112.66	25
Text-to-Image Retrieval	SARVLM-1M (test)	R@113.58	25
Object Detection	SARDet-100K	mAP60.2	23
Classification	PatternNet	Top-1 Accuracy12.98	22
Aircraft detection	SAR-Aircraft	mAP5086.2	11
Target Recognition	MSTAR SOC	Accuracy86.55	7
Target Recognition	SAR-VSA	Accuracy87.57	7
Zero-shot Classification	MSTAR SOC	Accuracy (Zero-shot)60.45	7

Showing 10 of 15 rows

Other info

Follow for update

@wizwand_team Discord