Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Cross-Modal and Uni-Modal Soft-Label Alignment for Image-Text Retrieval

About

Current image-text retrieval methods have demonstrated impressive performance in recent years. However, they still face two problems: the inter-modal matching missing problem and the intra-modal semantic loss problem. These problems can significantly affect the accuracy of image-text retrieval. To address these challenges, we propose a novel method called Cross-modal and Uni-modal Soft-label Alignment (CUSA). Our method leverages the power of uni-modal pre-trained models to provide soft-label supervision signals for the image-text retrieval model. Additionally, we introduce two alignment techniques, Cross-modal Soft-label Alignment (CSA) and Uni-modal Soft-label Alignment (USA), to overcome false negatives and enhance similarity recognition between uni-modal samples. Our method is designed to be plug-and-play, meaning it can be easily applied to existing image-text retrieval models without changing their original architectures. Extensive experiments on various image-text retrieval models and datasets, we demonstrate that our method can consistently improve the performance of image-text retrieval and achieve new state-of-the-art results. Furthermore, our method can also boost the uni-modal retrieval performance of image-text retrieval models, enabling it to achieve universal retrieval. The code and supplementary files can be found at https://github.com/lerogo/aaai24_itr_cusa.

Hailang Huang, Zhijie Nie, Ziqiao Wang, Ziyu Shang• 2024

Related benchmarks

TaskDatasetResultRank
Image-to-Text RetrievalFlickr30K 1K (test)
R@181
491
Text-to-Image RetrievalFlickr30K 1K (test)
R@190.8
432
Image-to-Text RetrievalMS-COCO 5K (test)
R@157.4
320
Text-to-Image RetrievalMSCOCO 5K (test)
R@167.9
308
Text-to-Image RetrievalMS-COCO 5K (test)
R@144.3
244
Image-to-Text RetrievalMSCOCO 5K (test)
R@152.4
64
Image-to-Text RetrievalCamoIT 3K (test)
R@115.1
15
Text-to-Image RetrievalCamoIT 3K (test)
R@113.5
15
Image-to-Text RetrievalCamoIT (test)
R@123.9
10
Text-to-Image RetrievalCamoIT (test)
R@123.5
10
Showing 10 of 12 rows

Other info

Follow for update