UNIGEOCLIP: Unified Geospatial Contrastive Learning

About

The growing availability of co-located geospatial data spanning aerial imagery, street-level views, elevation models, text, and geographic coordinates offers a unique opportunity for multimodal representation learning. We introduce UNIGEOCLIP, a massively multimodal contrastive framework to jointly align five complementary geospatial modalities in a single unified embedding space. Unlike prior approaches that fuse modalities or rely on a central pivot representation, our method performs all-to-all contrastive alignment, enabling seamless comparison, retrieval, and reasoning across arbitrary combinations of modalities. We further propose a scaled latitude-longitude encoder that improves spatial representation by capturing multi-scale geographic structure. Extensive experiments across downstream geospatial tasks demonstrate that UNIGEOCLIP consistently outperforms single-modality contrastive models and coordinate-only baselines, highlighting the benefits of holistic multimodal geospatial alignment. A reference implementation is available at https://gastruc.github.io/unigeoclip.

Guillaume Astruc, Eduard Trulls, Jan Hosang, Loic Landrieu, Paul-Edouard Sarlin• 2026

Related benchmarks

Task	Dataset	Result
Cross-view geo-localization	CVUSA (test)	--	36
Segmentation	m-chesapeake	Mean mIoU66.3	23
Image-to-Location Retrieval	Geospatial Street-View (USA) (test)	Accuracy @ 100m69.4	11
Classification	m-pv ger 4	Overall Accuracy (OA)97	10
Regression	27 downstream regression tasks (test)	Health Score53.1	8
Out-of-Distribution Cross-View Retrieval	Geospatial Street-View Amsterdam (test)	Accuracy @ 100m41.3	4
Multimodal Ensemble Retrieval	Geospatial Street-View (USA) (test)	Acc@100m91	3
Semantic segmentation	MDAS	Accuracy72	3

Showing 8 of 8 rows

Other info

Follow for update

@wizwand_team Discord