GAIR: Location-Aware Self-Supervised Contrastive Pre-Training with Geo-Aligned Implicit Representations

About

Vision Transformer (ViT) has been widely used in computer vision tasks with excellent results by providing representations for a whole image or image patches. However, ViT lacks detailed localized image representations at arbitrary positions when applied to geospatial tasks that involve multiple geospatial data modalities, such as overhead remote sensing (RS) data, ground-level imagery, and geospatial vector data. Here high-resolution localized representations are vital for modeling geospatial relationships and alignments across modalities. We proposed to solve this representation problem with an implicit neural representation (INR) module extending ViT with Neural Implicit Local Interpolation, which produces a continuous RS image representation covering arbitrary location in the RS image. Based on the INR module, we introduce GAIR, a novel location-aware self-supervised learning (SSL) objective integrating overhead RS data, street view (SV) imagery, and their geolocation metadata. GAIR utilizes three factorized neural encoders to project different modalities into the embedding space, and the INR module is used to further align these representations geographically, which are trained with contrastive learning objectives from unlabeled data. We evaluate GAIR across 9 geospatial tasks and 22 datasets spanning RS image-based, SV image-based, and location embedding-based benchmarks. Experimental results demonstrate that GAIR outperforms state-of-the-art geo-foundation models (GeoFM) and alternative SSL training objectives (e.g., MoCo V3 and MAE) that do not use fine-grained geo-aligned spatial representations. Our results highlight the effectiveness of GAIR in learning generalizable geospatial representations across tasks, spatial scales, and temporal contexts. The project code is available at https://github.com/zpl99/GAIR.

Zeping Liu, Ni Lao, Zhangyu Wang, Junfeng Jiao, Gengchen Mai• 2025

Related benchmarks

Task	Dataset	Result
Classification	Land Cover	F1 Score67.3	76
Classification	Land Use Coarse	F1 Score61.7	70
Classification	Land Use Fine	F1 Score55.2	70
Regression	Urban Perception avg. 6 tasks	R2 Score17.4	58
Regression	Crime Incidence	R-squared (%)87.5	48
Urban Perception	Place Pulse 2.0	Cleanliness6.7	44
Semantic Segmentation (Cropland)	AI4SmallFarms	mIoU43.47	42
Semantic Segmentation (Burn Scars)	AI4SmallFarms	mIoU87	42
Regression	ZIP Code weighted avg. 29 tasks (cross-regional)	R^262.5	40
Human Perception Regression	Street View Imagery	RMSE1.5072	39

Showing 10 of 46 rows

Other info

Follow for update

@wizwand_team Discord