Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Transcending Fusion: A Multi-Scale Alignment Method for Remote Sensing Image-Text Retrieval

About

Remote Sensing Image-Text Retrieval (RSITR) is pivotal for knowledge services and data mining in the remote sensing (RS) domain. Considering the multi-scale representations in image content and text vocabulary can enable the models to learn richer representations and enhance retrieval. Current multi-scale RSITR approaches typically align multi-scale fused image features with text features, but overlook aligning image-text pairs at distinct scales separately. This oversight restricts their ability to learn joint representations suitable for effective retrieval. We introduce a novel Multi-Scale Alignment (MSA) method to overcome this limitation. Our method comprises three key innovations: (1) Multi-scale Cross-Modal Alignment Transformer (MSCMAT), which computes cross-attention between single-scale image features and localized text features, integrating global textual context to derive a matching score matrix within a mini-batch, (2) a multi-scale cross-modal semantic alignment loss that enforces semantic alignment across scales, and (3) a cross-scale multi-modal semantic consistency loss that uses the matching matrix from the largest scale to guide alignment at smaller scales. We evaluated our method across multiple datasets, demonstrating its efficacy with various visual backbones and establishing its superiority over existing state-of-the-art methods. The GitHub URL for our project is: https://github.com/yr666666/MSA

Rui Yang, Shuang Wang, Yingping Han, Yuanheng Li, Dong Zhao, Dou Quan, Yanhe Guo, Licheng Jiao• 2024

Related benchmarks

TaskDatasetResultRank
Image-Text RetrievalRSICD
Mean Recall20.62
119
Text-to-Image RetrievalNWPU (test)
R@16.69
44
Image-to-Text RetrievalNWPU (test)
Recall@1 (R@1)6.46
44
Remote Sensing Image-Text RetrievalRSICD (test)
Text Retrieval R@19.52
14
Remote Sensing Image-Text RetrievalRSITMD (test)
Text Retrieval R@115.93
14
Image-to-Text RetrievalRSITMD 20% noise ratio
R@112.88
11
Text-to-Image RetrievalRSITMD 20% noise ratio
Recall@110.11
11
Text-to-Image RetrievalRSITMD 40% noise ratio
Recall@1 (R@1)5.24
11
Text-to-Image RetrievalRSITMD 60% noise ratio
Recall@11.68
11
Text-to-Image RetrievalRSITMD 80% noise ratio
R@10.86
11
Showing 10 of 13 rows

Other info

Follow for update