SATGround: A Spatially-Aware Approach for Visual Grounding in Remote Sensing

About

Vision-language models (VLMs) are emerging as powerful generalist tools for remote sensing, capable of integrating information across diverse tasks and enabling flexible, instruction-based interactions via a chat interface. In this work, we enhance VLM-based visual grounding in satellite imagery by proposing a novel structured localization mechanism. Our approach involves finetuning a pretrained VLM on a diverse set of instruction-following tasks, while interfacing a dedicated grounding module through specialized control tokens for localization. This method facilitates joint reasoning over both language and spatial information, significantly enhancing the model's ability to precisely localize objects in complex satellite scenes. We evaluate our framework on several remote sensing benchmarks, consistently improving the state-of-the-art, including a 33.2% relative improvement over previous methods on visual grounding. Our results highlight the benefits of integrating structured spatial reasoning into VLMs, paving the way for more reliable real-world satellite data analysis. Code will be released upon acceptance.

Aysim Toker, Andreea-Maria Oncescu, Roy Miles, Ismail Elezi, Jiankang Deng• 2025

Related benchmarks

Task	Dataset	Result
Referred Object Detection	NWPU VHR-10 (test)	AP (Small)14.6	9
Referred Object Detection	Swimming Pool Dataset (test)	Medium Object Metric8.4	9
Referring	GeoChat	Referring Accuracy (Average)31.3	9
Visual Grounding	GeoChat	Accuracy (Small)11.5	9
Visual Question Answering	LRBEN v1.0 (test)	Presence91.7	9
Referred Object Detection	Urban Tree Crown Detection (test)	AP Medium4.9	9
Visual Question Answering	HRBEN	Presence Precision@163	7

Showing 7 of 7 rows

Other info

Follow for update

@wizwand_team Discord