GeoReasoner: Geo-localization with Reasoning in Street Views using a Large Vision-Language Model

About

This work tackles the problem of geo-localization with a new paradigm using a large vision-language model (LVLM) augmented with human inference knowledge. A primary challenge here is the scarcity of data for training the LVLM - existing street-view datasets often contain numerous low-quality images lacking visual clues, and lack any reasoning inference. To address the data-quality issue, we devise a CLIP-based network to quantify the degree of street-view images being locatable, leading to the creation of a new dataset comprising highly locatable street views. To enhance reasoning inference, we integrate external knowledge obtained from real geo-localization games, tapping into valuable human inference capabilities. The data are utilized to train GeoReasoner, which undergoes fine-tuning through dedicated reasoning and location-tuning stages. Qualitative and quantitative evaluations illustrate that GeoReasoner outperforms counterpart LVLMs by more than 25% at country-level and 38% at city-level geo-localization tasks, and surpasses StreetCLIP performance while requiring fewer training resources. The data and code are available at https://github.com/lingli1996/GeoReasoner.

Ling Li, Yu Ye, Yao Zhou, Bingchuan Jiang, Wei Zeng• 2024

Related benchmarks

Task	Dataset	Result
Image Geolocalization	IM2GPS3K (test)	Success Rate (25km)33.4	159
Image Geolocalization	YFCC4k	Success Rate (1km)2	46
Image Geolocalization	IM2GPS	Success Rate @ 25 km (City)44	26
Visual Geolocalization	CityGuessr68k	City Accuracy38.5	15
Image Geolocalization	YFCC26k	Success Rate @ 1 km (Street)4	14
Visual Geolocation	Im2GPS3k	Success Rate @ 1km10	10
Geolocation	GeoSeek (val)	Success Rate (City 25km)13.55	9
Image Geolocalization	MP16-Reason	Street 1km Success Rate10.06	9
Image Geolocation	CCL-Bench	City ACC18.33	8
Image Geolocation	CCL-Bench	Accuracy @ 1km0.33	8

Showing 10 of 16 rows

Other info

Follow for update

@wizwand_team Discord