Locatability-Guided Adaptive Reasoning for Image Geo-Localization with Vision-Language Models

About

The emergence of Vision-Language Models (VLMs) has introduced new paradigms for global image geo-localization through retrieval-augmented generation (RAG) and reasoning-driven inference. However, RAG methods are constrained by retrieval database quality, while reasoning-driven approaches fail to internalize image locatability, relying on inefficient, fixed-depth reasoning paths that increase hallucinations and degrade accuracy. To overcome these limitations, we introduce an Optimized Locatability Score that quantifies an image's suitability for deep reasoning in geo-localization. Using this metric, we curate Geo-ADAPT-51K, a locatability-stratified reasoning dataset enriched with augmented reasoning trajectories for complex visual scenes. Building on this foundation, we propose a two-stage Group Relative Policy Optimization (GRPO) curriculum with customized reward functions that regulate adaptive reasoning depth, visual grounding, and hierarchical geographical accuracy. Our framework, Geo-ADAPT, learns an adaptive reasoning policy, achieves state-of-the-art performance across multiple geo-localization benchmarks, and substantially reduces hallucinations by reasoning both adaptively and efficiently.

Bo Yu, Fengze Yang, Yiming Liu, Chao Wang, Xuewen Luo, Taozhe Li, Ruimin Ke, Xiaofan Zhou, Chenxi Liu• 2026

Related benchmarks

Task	Dataset	Result
Image Geolocalization	YFCC4k	Success Rate (1km)32.5	46
Image Geolocalization	Im2GPS3k	Success Rate @ 1 km17.9	43
City and country name prediction	Geo-ADAPT-51K (test)	City Name Accuracy55.8	7

Showing 3 of 3 rows

Other info

Follow for update

@wizwand_team Discord