GeoAlignCLIP: Enhancing Fine-Grained Vision-Language Alignment in Remote Sensing via Multi-Granular Consistency Learning
About
Vision-language pretraining models have made significant progress in bridging remote sensing imagery with natural language. However, existing approaches often fail to effectively integrate multi-granular visual and textual information, relying primarily on global image-text alignment. This limitation hinders the model's ability to accurately capture fine-grained details in images, thus restricting its performance in complex, fine-grained tasks. To address this, we propose GeoAlignCLIP, a unified framework that achieves fine-grained alignment in remote sensing tasks by learning multi-granular semantic alignments and incorporating intra-modal consistency, enabling more precise visual-semantic alignment between image regions and text concepts. Additionally, we construct RSFG-100k, a fine-granular remote sensing dataset containing scene descriptions, region-level annotations, and challenging hard-negative samples, providing hierarchical supervision for model training. Extensive experiments conducted on multiple public remote-sensing benchmarks demonstrate that GeoAlignCLIP consistently outperforms existing RS-specific methods across diverse tasks, exhibiting more robust and accurate fine-grained vision-language alignment.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Object Detection | DOTA 1.0 (test) | -- | 256 | |
| Image-to-Text Retrieval | RSITMD (test) | R@121.02 | 77 | |
| Text-to-Image Retrieval | RSITMD (test) | R@117.39 | 77 | |
| Text-to-Image Retrieval | RSICD (test) | R@111.4 | 50 | |
| Image-to-Text Retrieval | RSICD (test) | R@114.82 | 29 | |
| Object Detection | DIOR (test) | -- | 24 | |
| Fine-grained Understanding | RRSIS-HR | Acc@133.45 | 21 | |
| Fine-grained Understanding | CHOICE img | Accuracy92 | 21 | |
| Region-level Classification | NWPU VHR-10 | Top-1 Accuracy93.75 | 21 | |
| Region-level Classification | RRSIS-D | Top-1 Accuracy82.89 | 21 |