Optimal Transport Aggregation for Visual Place Recognition
About
The task of Visual Place Recognition (VPR) aims to match a query image against references from an extensive database of images from different places, relying solely on visual cues. State-of-the-art pipelines focus on the aggregation of features extracted from a deep backbone, in order to form a global descriptor for each image. In this context, we introduce SALAD (Sinkhorn Algorithm for Locally Aggregated Descriptors), which reformulates NetVLAD's soft-assignment of local features to clusters as an optimal transport problem. In SALAD, we consider both feature-to-cluster and cluster-to-feature relations and we also introduce a 'dustbin' cluster, designed to selectively discard features deemed non-informative, enhancing the overall descriptor quality. Additionally, we leverage and fine-tune DINOv2 as a backbone, which provides enhanced description power for the local features, and dramatically reduces the required training time. As a result, our single-stage method not only surpasses single-stage baselines in public VPR datasets, but also surpasses two-stage methods that add a re-ranking with significantly higher cost. Code and models are available at https://github.com/serizba/salad.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Visual Place Recognition | MSLS (val) | Recall@194.2 | 236 | |
| Visual Place Recognition | Pitts30k | Recall@192.6 | 164 | |
| Visual Place Recognition | Tokyo24/7 | Recall@196.8 | 146 | |
| Visual Place Recognition | MSLS Challenge | Recall@182.7 | 134 | |
| Visual Place Recognition | Nordland | Recall@189.7 | 112 | |
| Visual Place Recognition | SPED | Recall@192.1 | 106 | |
| Visual Place Recognition | Pittsburgh30k (test) | Recall@192.5 | 86 | |
| Visual Place Recognition | Pitts250k | Recall@195.2 | 84 | |
| Visual Place Recognition | AmsterTime | Recall@158.8 | 83 | |
| Visual Place Recognition | St Lucia | R@1100 | 76 |