Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

VLM2GeoVec: Toward Universal Multimodal Embeddings for Remote Sensing

About

Satellite imagery differs fundamentally from natural images: its aerial viewpoint, very high resolution, diverse scale variations, and abundance of small objects demand both region-level spatial reasoning and holistic scene understanding. Current remote-sensing approaches remain fragmented between dual-encoder retrieval models, which excel at large-scale cross-modal search but cannot interleave modalities, and generative assistants, which support region-level interpretation but lack scalable retrieval capabilities. We propose $\textbf{VLM2GeoVec}$, an instruction-following, single-encoder vision-language model trained contrastively to embed interleaved inputs (images, text, bounding boxes, and geographic coordinates) in a unified vector space. Our single encoder interleaves all inputs into one joint embedding trained with a contrastive loss, eliminating multi-stage pipelines and task-specific modules. To evaluate its versatility, we introduce $\textbf{RSMEB}$, a novel benchmark covering key remote-sensing embedding applications: scene classification; cross-modal search; compositional retrieval; visual-question answering; visual grounding and region-level reasoning; and semantic geospatial retrieval. On RSMEB, it achieves $\textbf{26.6%}$ P@1 on region-caption retrieval (+25 pp vs. dual-encoder baselines), $\textbf{32.5%}$ P@1 on referring-expression retrieval (+19 pp), and $\textbf{17.8%}$ P@1 on semantic geo-localization retrieval (over $3\times$ prior best), while matching or exceeding specialized baselines on conventional tasks such as scene classification and cross-modal retrieval. VLM2GeoVec unifies scalable retrieval with region-level spatial reasoning, enabling cohesive multimodal analysis in remote sensing. We will publicly release the code, checkpoints, and data upon acceptance.

Emanuel S\'anchez Aimar, Gulnaz Zhambulova, Fahad Shahbaz Khan, Yonghao Xu, Michael Felsberg• 2025

Related benchmarks

TaskDatasetResultRank
Scene ClassificationAID
Top-1 Acc77.25
47
Scene ClassificationUCM
Top-1 Accuracy90.24
28
Image-Text RetrievalRSICD--
26
Image-to-Text RetrievalRSITMD
Rank2
19
Text-to-Image RetrievalRSITMD
mR2
19
Text-to-Image RetrievalUCM caption
R@1/5/1052.76
11
Scene ClassificationPatternNet
Accuracy79.76
7
Visual Question AnsweringHRBEN (test)
Presence Precision@169.47
7
Visual Question AnsweringLRBEN (test)
Presence Precision@189.78
7
Scene ClassificationMillion-AID
Accuracy64.82
7
Showing 10 of 19 rows

Other info

Follow for update