Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Cross-Modal Adapter for Vision-Language Retrieval

About

Vision-language retrieval is an important multi-modal learning topic, where the goal is to retrieve the most relevant visual candidate for a given text query. Recently, pre-trained models, e.g., CLIP, show great potential on retrieval tasks. However, as pre-trained models are scaling up, fully fine-tuning them on donwstream retrieval datasets has a high risk of overfitting. Moreover, in practice, it would be costly to train and store a large model for each task. To overcome the above issues, we present a novel Cross-Modal Adapter for parameter-efficient transfer learning. Inspired by adapter-based methods, we adjust the pre-trained model with a few parameterization layers. However, there are two notable differences. First, our method is designed for the multi-modal domain. Secondly, it allows encoder-level implicit cross-modal interactions between vision and language encoders. Although surprisingly simple, our approach has three notable benefits: (1) reduces the vast majority of fine-tuned parameters, (2) saves training time, and (3) allows all the pre-trained parameters to be fixed, enabling the pre-trained model to be shared across datasets. Extensive experiments demonstrate that, without bells and whistles, our approach outperforms adapter-based methods on image-text retrieval datasets (MSCOCO, Flickr30K) and video-text retrieval datasets (MSR-VTT, DiDeMo, and ActivityNet).

Haojun Jiang, Jianke Zhang, Rui Huang, Chunjiang Ge, Zanlin Ni, Shiji Song, Gao Huang• 2022

Related benchmarks

TaskDatasetResultRank
Text-to-Video RetrievalMSVD (test)
R@147.4
204
Text-to-Video RetrievalVATEX (test)
R@159.3
62
Image-to-Text RetrievalRSITMD (test)
R@118.16
61
Text-to-Image RetrievalRSITMD (test)
R@116.31
61
Video-to-Text retrievalMSVD (test)
R@163.6
61
Text RetrievalRSICD (test)
R@111.18
51
Text-to-Video RetrievalMSR-VTT 1K (test)
R@145.4
45
Image-Text RetrievalRSICD (test)
mR19.61
43
Video-to-Text retrievalMSR-VTT 1K (test)
R@146.2
39
Cross-modal retrievalRSICD (test)
Image-to-Text R@111.18
32
Showing 10 of 15 rows

Other info

Follow for update