Unlocking Zero-Shot Geospatial Reasoning via Indirect Rewards

About

Training robust reasoning vision-language models (VLMs) in rare domains (such as geospatial) is fundamentally constrained by supervision scarcity. While raw geospatial imagery is abundant, the amount of task-direct supervision falls far behind that of common domains. In this work, we validate an important conclusion: indirect verifiable rewards, derived from seemingly unrelated metadata, are sufficient to induce sophisticated and generalizable geospatial reasoning across a wide range of downstream tasks (25+). We present Geo-R1 as one empirical instantiation of this paradigm. Rather than relying on limited task-specific annotations (i.e., direct rewards), Geo-R1 utilizes scalable, verifiable indirect proxy rewards based on cross-view alignment with metadata (geolocation information) to drive reinforcement learning at scale. Such indirect rewards successfully motivate the model to discover and internalize zero-shot geospatial reasoning across diverse tasks, achieving extraordinary zero-shot transfer on out-of-distribution benchmarks and even surpassing fully supervised specialists on certain benchmarks. These findings indicate that optimizing for indirect verifiable rewards may provide a scalable pathway to unlock generalized reasoning capabilities in rare domains with massive unlabeled data archives. Our code is availavle at: https://github.com/miniHuiHui/Geo-R1.

Chenhui Xu, Fuxun Yu, Michael J. Bianco, Jacob Kovarskiy, Raphael Tang, Qi Zhang, Zirui Xu, Will LeVine, Brandon Dubbs, Heming Liao, Cassandra Burgess, Suvam Bag, Jay Patravali, Rupanjali Kukal, Mikael Figueroa, Rishi Madhok, Nikolaos Karianakis, Jinjun Xiong• 2025

Related benchmarks

Task	Dataset	Result
Image Geolocalization	IM2GPS3K (test)	Success Rate (25km)41.3	159
Visual Grounding	DIOR-RSVG	Accuracy@0.517.67	34
Referring Expression Comprehension	VRSBench (test)	Accuracy@0.549.6	16
Open-Vocabulary Detection	NWPU VHR-10 (val)	mAP (IoU=0.5:0.95)18.87	13
Geographic Localization	IMAGEO-GSS (test)	City Accuracy32.7264	10
Visual Question Answering	VRSBench	Avg@557	10
Visual Grounding	VRSBench Ref	IoU@5017.18	10
Visual Question Answering	RSFG-SC	Scene Accuracy52.46	10
Visual Question Answering	RSFG-VQA	Avg@50.4503	10
Visual Question Answering	RSVQA	Avg@534.5	10

Showing 10 of 17 rows

Other info

Follow for update

@wizwand_team Discord