VLM2GeoVec: Toward Universal Multimodal Embeddings for Remote Sensing

About

Satellite imagery differs fundamentally from natural images: its aerial viewpoint, very high resolution, diverse scale variations, and abundance of small objects demand both region-level spatial reasoning and holistic scene understanding. Current remote-sensing approaches remain fragmented between dual-encoder retrieval models, which excel at large-scale cross-modal search but cannot interleave modalities, and generative assistants, which support region-level interpretation but lack scalable retrieval capabilities. We propose $\textbf{VLM2GeoVec}$, an instruction-following, single-encoder vision-language model trained contrastively to embed interleaved inputs (images, text, bounding boxes, and geographic coordinates) in a unified vector space. Our single encoder interleaves all inputs into one joint embedding trained with a contrastive loss, eliminating multi-stage pipelines and task-specific modules. To evaluate its versatility, we introduce $\textbf{RSMEB}$, a novel benchmark covering key remote-sensing embedding applications: scene classification; cross-modal search; compositional retrieval; visual-question answering; visual grounding and region-level reasoning; and semantic geospatial retrieval. On RSMEB, it achieves $\textbf{26.6%}$ P@1 on region-caption retrieval (+25 pp vs. dual-encoder baselines), $\textbf{32.5%}$ P@1 on referring-expression retrieval (+19 pp), and $\textbf{17.8%}$ P@1 on semantic geo-localization retrieval (over $3\times$ prior best), while matching or exceeding specialized baselines on conventional tasks such as scene classification and cross-modal retrieval. VLM2GeoVec unifies scalable retrieval with region-level spatial reasoning, enabling cohesive multimodal analysis in remote sensing. We will publicly release the code, checkpoints, and data upon acceptance.

Emanuel S\'anchez Aimar, Gulnaz Zhambulova, Fahad Shahbaz Khan, Yonghao Xu, Michael Felsberg• 2025

Related benchmarks

Task	Dataset	Result
Image-Text Retrieval	RSICD	--	119
Scene Classification	AID	Top-1 Acc77.25	69
Image-to-Text Retrieval	RSITMD	Rank2	47
Text-to-Image Retrieval	RSITMD	mR2	47
Scene Classification	UCM	Top-1 Accuracy90.24	28
Visual Grounding	RefExp	--	19
Text-to-Image Retrieval	UCM caption	R@1/5/1052.76	11
Scene Classification	PatternNet	Accuracy79.76	7
Visual Question Answering	HRBEN (test)	Presence Precision@169.47	7
Visual Question Answering	LRBEN (test)	Presence Precision@189.78	7

Showing 10 of 19 rows

Other info

Follow for update

@wizwand_team Discord