Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

GeoReasoner: Geo-localization with Reasoning in Street Views using a Large Vision-Language Model

About

This work tackles the problem of geo-localization with a new paradigm using a large vision-language model (LVLM) augmented with human inference knowledge. A primary challenge here is the scarcity of data for training the LVLM - existing street-view datasets often contain numerous low-quality images lacking visual clues, and lack any reasoning inference. To address the data-quality issue, we devise a CLIP-based network to quantify the degree of street-view images being locatable, leading to the creation of a new dataset comprising highly locatable street views. To enhance reasoning inference, we integrate external knowledge obtained from real geo-localization games, tapping into valuable human inference capabilities. The data are utilized to train GeoReasoner, which undergoes fine-tuning through dedicated reasoning and location-tuning stages. Qualitative and quantitative evaluations illustrate that GeoReasoner outperforms counterpart LVLMs by more than 25% at country-level and 38% at city-level geo-localization tasks, and surpasses StreetCLIP performance while requiring fewer training resources. The data and code are available at https://github.com/lingli1996/GeoReasoner.

Ling Li, Yu Ye, Yao Zhou, Bingchuan Jiang, Wei Zeng• 2024

Related benchmarks

TaskDatasetResultRank
Image GeolocalizationIM2GPS3K (test)
Success Rate (25km)26.94
93
Image GeolocalizationIM2GPS
Success Rate @ 1 km (Street)13
14
Visual GeolocationIm2GPS3k
Success Rate @ 1km10
10
GeolocationGeoSeek (val)
Success Rate (City 25km)13.55
9
Image GeolocationCCL-Bench
City ACC18.33
8
Image GeolocationCCL-Bench
Accuracy @ 1km0.33
8
Showing 6 of 6 rows

Other info

Follow for update