Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

GeoAlignCLIP: Enhancing Fine-Grained Vision-Language Alignment in Remote Sensing via Multi-Granular Consistency Learning

About

Vision-language pretraining models have made significant progress in bridging remote sensing imagery with natural language. However, existing approaches often fail to effectively integrate multi-granular visual and textual information, relying primarily on global image-text alignment. This limitation hinders the model's ability to accurately capture fine-grained details in images, thus restricting its performance in complex, fine-grained tasks. To address this, we propose GeoAlignCLIP, a unified framework that achieves fine-grained alignment in remote sensing tasks by learning multi-granular semantic alignments and incorporating intra-modal consistency, enabling more precise visual-semantic alignment between image regions and text concepts. Additionally, we construct RSFG-100k, a fine-granular remote sensing dataset containing scene descriptions, region-level annotations, and challenging hard-negative samples, providing hierarchical supervision for model training. Extensive experiments conducted on multiple public remote-sensing benchmarks demonstrate that GeoAlignCLIP consistently outperforms existing RS-specific methods across diverse tasks, exhibiting more robust and accurate fine-grained vision-language alignment.

Xiao Yang, Ronghao Fu, Zhuoran Duan, Zhiwen Lin, Xueyan Liu, Bo Yang• 2026

Related benchmarks

TaskDatasetResultRank
Object DetectionDOTA 1.0 (test)--
256
Image-to-Text RetrievalRSITMD (test)
R@121.02
77
Text-to-Image RetrievalRSITMD (test)
R@117.39
77
Text-to-Image RetrievalRSICD (test)
R@111.4
50
Image-to-Text RetrievalRSICD (test)
R@114.82
29
Object DetectionDIOR (test)--
24
Fine-grained UnderstandingRRSIS-HR
Acc@133.45
21
Fine-grained UnderstandingCHOICE img
Accuracy92
21
Region-level ClassificationNWPU VHR-10
Top-1 Accuracy93.75
21
Region-level ClassificationRRSIS-D
Top-1 Accuracy82.89
21
Showing 10 of 13 rows

Other info

Follow for update