OSMDA: OpenStreetMap-based Domain Adaptation for Remote Sensing VLMs

About

Vision-Language Models (VLMs) adapted to remote sensing rely heavily on domain-specific image-text supervision, yet high-quality annotations for satellite and aerial imagery remain scarce and expensive to produce. Prevailing pseudo-labeling pipelines address this gap by distilling knowledge from large frontier models, but this dependence on large teachers is costly, limits scalability, and caps achievable performance at the ceiling of the teacher. We propose OSMDA: a self-contained domain adaptation framework that eliminates this dependency. Our key insight is that a capable base VLM can serve as its own annotation engine: by pairing aerial images with rendered OpenStreetMap (OSM) tiles, we leverage optical character recognition and chart comprehension capabilities of the model to generate captions enriched by OSM's vast auxiliary metadata. The model is then fine-tuned on the resulting corpus with satellite imagery alone, yielding OSMDA-VLM, a domain-adapted VLM that requires no manual labeling and no stronger external model. We conduct exhaustive evaluations spanning 10 benchmarks across image-text-to-text tasks and comparing against 9 competitive baselines. When equally mixed with real data, our method achieves state-of-the-art results, while being substantially cheaper to train than teacher-dependent alternatives. These results suggest that, given a strong foundation model, alignment with crowd-sourced geographic data is a practical and scalable path towards remote sensing domain adaptation. Dataset and model weights will be made publicly available.

Stefan Maria Ailuro, Mario Markov, Mohammad Mahdi, Delyan Boychev, Luc Van Gool, Danda Pani Paudel• 2026

Related benchmarks

Task	Dataset	Result
Visual Question Answering	RSVQA-HR	Average Score72.5	38
Remote Sensing Scene Classification	EuroSAT	--	15
Visual Question Answering	RSVQA LR	Aggregated Score80.6	14
Image Captioning	NWPU-Captions	GEval0.395	10
Image Captioning	UCM Captions	GEval Score50	10
Image Captioning	VRSBench caption	GEval Score0.429	10
Image Captioning	XLRS-Bench caption	GEval Score40.4	10
Remote Sensing Scene Classification	Million-AID	F1 Score50.4	10
Visual Question Answering	VRSBench vqa	F1 Score74.4	10
Visual Question Answering	XLRS-Bench vqa	F1 Score21.6	10

Showing 10 of 12 rows

Other info

Follow for update

@wizwand_team Discord