SkyLink: A Large Vision-Language Model Driven Re-ranking Framework for Cross-View UAV geolocalization
About
Cross-view UAV geolocalization is fundamentally a challenging large-scale image retrieval task, aiming to determine the geographic coordinates of Unmanned Aerial Vehicle (UAV) queries by matching them against an extensive geo-tagged satellite image database. Most existing methods learn separate feature representations for each view and determine the final prediction using naive heuristics to assess feature similarity, thereby neglecting to model the crucial cross-view relationships. In this paper, we propose SkyLink, a novel plug-and-play ranking framework that pioneers joint relational modeling of inter-view relationships to enhance cross-view UAV geolocalization. SkyLink leverages a Large Vision-Language Model (LVLM) to model the intricate visual-semantic relationships between UAV and satellite views, facilitating effective cross-view matching. To further refine the learning process, we introduce a relational-aware loss. It leverages soft labels to provide a more nuanced supervision signal, mitigating the harsh penalty on near-positive pairs. This approach enhances both training stability and the model's discriminative capacity. Extensive experiments conducted across multiple base retrieval architectures and benchmark datasets demonstrate that SkyLink significantly boosts the ranking effectiveness of existing models, consistently achieving superior performance in various challenging scenarios.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Drone-to-Satellite Cross-view Geo-localization | SUES-200 150m | R@195.77 | 74 | |
| Satellite→Drone Geo-localization | SUES-200 300m | R@197.62 | 40 | |
| Satellite→Drone Geo-localization | SUES-200 250m | R@197.75 | 36 | |
| Satellite→Drone Geo-localization | SUES-200 200m | R@196.45 | 36 | |
| Cross-view geo-localization | University-1652 D2S | Recall@195.28 | 6 | |
| Cross-view geo-localization | University-1652 S2D | R@194.58 | 6 |