Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

UNIGEOCLIP: Unified Geospatial Contrastive Learning

About

The growing availability of co-located geospatial data spanning aerial imagery, street-level views, elevation models, text, and geographic coordinates offers a unique opportunity for multimodal representation learning. We introduce UNIGEOCLIP, a massively multimodal contrastive framework to jointly align five complementary geospatial modalities in a single unified embedding space. Unlike prior approaches that fuse modalities or rely on a central pivot representation, our method performs all-to-all contrastive alignment, enabling seamless comparison, retrieval, and reasoning across arbitrary combinations of modalities. We further propose a scaled latitude-longitude encoder that improves spatial representation by capturing multi-scale geographic structure. Extensive experiments across downstream geospatial tasks demonstrate that UNIGEOCLIP consistently outperforms single-modality contrastive models and coordinate-only baselines, highlighting the benefits of holistic multimodal geospatial alignment. A reference implementation is available at https://gastruc.github.io/unigeoclip.

Guillaume Astruc, Eduard Trulls, Jan Hosang, Loic Landrieu, Paul-Edouard Sarlin• 2026

Related benchmarks

TaskDatasetResultRank
Cross-view geo-localizationCVUSA (test)--
36
Segmentationm-chesapeake
Mean mIoU66.3
23
Image-to-Location RetrievalGeospatial Street-View (USA) (test)
Accuracy @ 100m69.4
11
Classificationm-pv ger 4
Overall Accuracy (OA)97
10
Regression27 downstream regression tasks (test)
Health Score53.1
8
Out-of-Distribution Cross-View RetrievalGeospatial Street-View Amsterdam (test)
Accuracy @ 100m41.3
4
Multimodal Ensemble RetrievalGeospatial Street-View (USA) (test)
Acc@100m91
3
Semantic segmentationMDAS
Accuracy72
3
Showing 8 of 8 rows

Other info

Follow for update