Where am I? Cross-View Geo-localization with Natural Language Descriptions
About
Cross-view geo-localization identifies the locations of street-view images by matching them with geo-tagged satellite images or OSM. However, most existing studies focus on image-to-image retrieval, with fewer addressing text-guided retrieval, a task vital for applications like pedestrian navigation and emergency response. In this work, we introduce a novel task for cross-view geo-localization with natural language descriptions, which aims to retrieve corresponding satellite images or OSM database based on scene text descriptions. To support this task, we construct the CVG-Text dataset by collecting cross-view data from multiple cities and employing a scene text generation approach that leverages the annotation capabilities of Large Multimodal Models to produce high-quality scene text descriptions with localization details. Additionally, we propose a novel text-based retrieval localization method, CrossText2Loc, which improves recall by 10% and demonstrates excellent long-text retrieval capabilities. In terms of explainability, it not only provides similarity scores but also offers retrieval reasons. More information can be found at https://yejy53.github.io/CVG-Text/ .
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Cross-modal Geo-localization | CVG-Text (New York) | R@159.08 | 29 | |
| Cross-modal Geo-localization | CVG-Text (Brisbane) | Recall@147.58 | 15 | |
| Cross-modal Geo-localization | CVG-Text Tokyo | Recall@141.75 | 15 | |
| Cross-modal Geo-localization | CORE World-level 1.0 (All) | R@151.92 | 15 | |
| Cross-modal Geo-localization | CORE Intercontinental-level Subset1 1.0 | R@153.12 | 15 | |
| Cross-modal Geo-localization | CORE Intercontinental-level Subset3 1.0 | R@146.97 | 15 | |
| Cross-modal Geo-localization | CORE Intercontinental-level Subset4 1.0 | R@148.71 | 15 | |
| Cross-modal Geo-localization | CORE Intercontinental-level Subset2 1.0 | R@159.36 | 15 | |
| Text-to-Satellite Image Retrieval | CVG-Text (Brisbane) | R@146.08 | 14 | |
| Text-to-Satellite Image Retrieval | CVG-Text Tokyo | R@136.83 | 14 |