UNIGEOCLIP: Unified Geospatial Contrastive Learning
About
The growing availability of co-located geospatial data spanning aerial imagery, street-level views, elevation models, text, and geographic coordinates offers a unique opportunity for multimodal representation learning. We introduce UNIGEOCLIP, a massively multimodal contrastive framework to jointly align five complementary geospatial modalities in a single unified embedding space. Unlike prior approaches that fuse modalities or rely on a central pivot representation, our method performs all-to-all contrastive alignment, enabling seamless comparison, retrieval, and reasoning across arbitrary combinations of modalities. We further propose a scaled latitude-longitude encoder that improves spatial representation by capturing multi-scale geographic structure. Extensive experiments across downstream geospatial tasks demonstrate that UNIGEOCLIP consistently outperforms single-modality contrastive models and coordinate-only baselines, highlighting the benefits of holistic multimodal geospatial alignment. A reference implementation is available at https://gastruc.github.io/unigeoclip.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Cross-view geo-localization | CVUSA (test) | -- | 36 | |
| Segmentation | m-chesapeake | Mean mIoU66.3 | 23 | |
| Image-to-Location Retrieval | Geospatial Street-View (USA) (test) | Accuracy @ 100m69.4 | 11 | |
| Classification | m-pv ger 4 | Overall Accuracy (OA)97 | 10 | |
| Regression | 27 downstream regression tasks (test) | Health Score53.1 | 8 | |
| Out-of-Distribution Cross-View Retrieval | Geospatial Street-View Amsterdam (test) | Accuracy @ 100m41.3 | 4 | |
| Multimodal Ensemble Retrieval | Geospatial Street-View (USA) (test) | Acc@100m91 | 3 | |
| Semantic segmentation | MDAS | Accuracy72 | 3 |