Rethinking Visual Geo-localization for Large-Scale Applications
About
Visual Geo-localization (VG) is the task of estimating the position where a given photo was taken by comparing it with a large database of images of known locations. To investigate how existing techniques would perform on a real-world city-wide VG application, we build San Francisco eXtra Large, a new dataset covering a whole city and providing a wide range of challenging cases, with a size 30x bigger than the previous largest dataset for visual geo-localization. We find that current methods fail to scale to such large datasets, therefore we design a new highly scalable training technique, called CosPlace, which casts the training as a classification problem avoiding the expensive mining needed by the commonly used contrastive learning. We achieve state-of-the-art performance on a wide range of datasets and find that CosPlace is robust to heavy domain changes. Moreover, we show that, compared to the previous state-of-the-art, CosPlace requires roughly 80% less GPU memory at train time, and it achieves better results with 8x smaller descriptors, paving the way for city-wide real-world visual geo-localization. Dataset, code and trained models are available for research purposes at https://github.com/gmberton/CosPlace.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Visual Place Recognition | MSLS (val) | Recall@187.4 | 236 | |
| Visual Place Recognition | Pitts30k | Recall@190.9 | 164 | |
| Visual Place Recognition | Tokyo24/7 | Recall@189.5 | 146 | |
| Visual Place Recognition | MSLS Challenge | Recall@167.5 | 134 | |
| Visual Place Recognition | Nordland | Recall@158.5 | 112 | |
| Visual Place Recognition | SPED | Recall@180.1 | 106 | |
| Visual Place Recognition | Pittsburgh30k (test) | Recall@188.4 | 86 | |
| Visual Place Recognition | Pitts250k | Recall@192.3 | 84 | |
| Visual Place Recognition | AmsterTime | Recall@147.7 | 83 | |
| Visual Place Recognition | St Lucia | R@199.6 | 76 |