Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

GAIR: Location-Aware Self-Supervised Contrastive Pre-Training with Geo-Aligned Implicit Representations

About

Vision Transformer (ViT) has been widely used in computer vision tasks with excellent results by providing representations for a whole image or image patches. However, ViT lacks detailed localized image representations at arbitrary positions when applied to geospatial tasks that involve multiple geospatial data modalities, such as overhead remote sensing (RS) data, ground-level imagery, and geospatial vector data. Here high-resolution localized representations are vital for modeling geospatial relationships and alignments across modalities. We proposed to solve this representation problem with an implicit neural representation (INR) module extending ViT with Neural Implicit Local Interpolation, which produces a continuous RS image representation covering arbitrary location in the RS image. Based on the INR module, we introduce GAIR, a novel location-aware self-supervised learning (SSL) objective integrating overhead RS data, street view (SV) imagery, and their geolocation metadata. GAIR utilizes three factorized neural encoders to project different modalities into the embedding space, and the INR module is used to further align these representations geographically, which are trained with contrastive learning objectives from unlabeled data. We evaluate GAIR across 9 geospatial tasks and 22 datasets spanning RS image-based, SV image-based, and location embedding-based benchmarks. Experimental results demonstrate that GAIR outperforms state-of-the-art geo-foundation models (GeoFM) and alternative SSL training objectives (e.g., MoCo V3 and MAE) that do not use fine-grained geo-aligned spatial representations. Our results highlight the effectiveness of GAIR in learning generalizable geospatial representations across tasks, spatial scales, and temporal contexts. The project code is available at https://github.com/zpl99/GAIR.

Zeping Liu, Ni Lao, Zhangyu Wang, Junfeng Jiao, Gengchen Mai• 2025

Related benchmarks

TaskDatasetResultRank
ClassificationLand Cover
F1 Score67.3
76
ClassificationLand Use Coarse
F1 Score61.7
70
ClassificationLand Use Fine
F1 Score55.2
70
RegressionUrban Perception avg. 6 tasks
R2 Score17.4
58
RegressionCrime Incidence
R-squared (%)87.5
48
Urban PerceptionPlace Pulse 2.0
Cleanliness6.7
44
Semantic Segmentation (Cropland)AI4SmallFarms
mIoU43.47
42
Semantic Segmentation (Burn Scars)AI4SmallFarms
mIoU87
42
RegressionZIP Code weighted avg. 29 tasks (cross-regional)
R^262.5
40
Human Perception RegressionStreet View Imagery
RMSE1.5072
39
Showing 10 of 46 rows

Other info

Follow for update