Quality-Driven Curation of Remote Sensing Vision-Language Data via Learned Scoring Models

About

Vision-Language Models (VLMs) have demonstrated great potential in interpreting remote sensing (RS) images through language-guided semantic. However, the effectiveness of these VLMs critically depends on high-quality image-text training data that captures rich semantic relationships between visual content and language descriptions. Unlike natural images, RS lacks large-scale interleaved image-text pairs from web data, making data collection challenging. While current approaches rely primarily on rule-based methods or flagship VLMs for data synthesis, a systematic framework for automated quality assessment of such synthetically generated RS vision-language data is notably absent. To fill this gap, we propose a novel score model trained on large-scale RS vision-language preference data for automated quality assessment. Our empirical results demonstrate that fine-tuning CLIP or advanced VLMs (e.g., Qwen2-VL) with the top 30% of data ranked by our score model achieves superior accuracy compared to both full-data fine-tuning and CLIP-score-based ranking approaches. Furthermore, we demonstrate applications of our scoring model for reinforcement learning (RL) training and best-of-N (BoN) test-time scaling, enabling significant improvements in VLM performance for RS tasks. Our code, model, and dataset are publicly available

Dilxat Muhtar, Enzhuo Zhang, Zhenshi Li, Feng Gu, Yanglangxing He, Pengfeng Xiao, Xueliang Zhang• 2025

Related benchmarks

Task	Dataset	Result	Rank
Scene Classification	AID	Top-1 Acc85.9		69
Visual Grounding	DIOR-RSVG	Accuracy@0.564.52		34

Showing 2 of 2 rows

Other info

Follow for update

@wizwand_team Discord