Unlocking Zero-Shot Geospatial Reasoning via Indirect Rewards
About
Training robust reasoning vision-language models (VLMs) in rare domains (such as geospatial) is fundamentally constrained by supervision scarcity. While raw geospatial imagery is abundant, the amount of task-direct supervision falls far behind that of common domains. In this work, we validate an important conclusion: indirect verifiable rewards, derived from seemingly unrelated metadata, are sufficient to induce sophisticated and generalizable geospatial reasoning across a wide range of downstream tasks (25+). We present Geo-R1 as one empirical instantiation of this paradigm. Rather than relying on limited task-specific annotations (i.e., direct rewards), Geo-R1 utilizes scalable, verifiable indirect proxy rewards based on cross-view alignment with metadata (geolocation information) to drive reinforcement learning at scale. Such indirect rewards successfully motivate the model to discover and internalize zero-shot geospatial reasoning across diverse tasks, achieving extraordinary zero-shot transfer on out-of-distribution benchmarks and even surpassing fully supervised specialists on certain benchmarks. These findings indicate that optimizing for indirect verifiable rewards may provide a scalable pathway to unlock generalized reasoning capabilities in rare domains with massive unlabeled data archives. Our code is availavle at: https://github.com/miniHuiHui/Geo-R1.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Image Geolocalization | IM2GPS3K (test) | Success Rate (25km)41.3 | 159 | |
| Visual Grounding | DIOR-RSVG | Accuracy@0.517.67 | 34 | |
| Referring Expression Comprehension | VRSBench (test) | Accuracy@0.549.6 | 16 | |
| Open-Vocabulary Detection | NWPU VHR-10 (val) | mAP (IoU=0.5:0.95)18.87 | 13 | |
| Geographic Localization | IMAGEO-GSS (test) | City Accuracy32.7264 | 10 | |
| Visual Question Answering | VRSBench | Avg@557 | 10 | |
| Visual Grounding | VRSBench Ref | IoU@5017.18 | 10 | |
| Visual Question Answering | RSFG-SC | Scene Accuracy52.46 | 10 | |
| Visual Question Answering | RSFG-VQA | Avg@50.4503 | 10 | |
| Visual Question Answering | RSVQA | Avg@534.5 | 10 |