Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

UniAdapter: Unified Parameter-Efficient Transfer Learning for Cross-modal Modeling

About

Large-scale vision-language pre-trained models have shown promising transferability to various downstream tasks. As the size of these foundation models and the number of downstream tasks grow, the standard full fine-tuning paradigm becomes unsustainable due to heavy computational and storage costs. This paper proposes UniAdapter, which unifies unimodal and multimodal adapters for parameter-efficient cross-modal adaptation on pre-trained vision-language models. Specifically, adapters are distributed to different modalities and their interactions, with the total number of tunable parameters reduced by partial weight sharing. The unified and knowledge-sharing design enables powerful cross-modal representations that can benefit various downstream tasks, requiring only 1.0%-2.0% tunable parameters of the pre-trained model. Extensive experiments on 6 cross-modal downstream benchmarks (including video-text retrieval, image-text retrieval, VideoQA, and VQA) show that in most cases, UniAdapter not only outperforms the state-of-the-arts, but even beats the full fine-tuning strategy. Particularly, on the MSRVTT retrieval task, UniAdapter achieves 49.7% recall@1 with 2.2% model parameters, outperforming the latest competitors by 2.0%. The code and models are available at https://github.com/RERV/UniAdapter.

Haoyu Lu, Yuqi Huo, Guoxing Yang, Zhiwu Lu, Wei Zhan, Masayoshi Tomizuka, Mingyu Ding• 2023

Related benchmarks

TaskDatasetResultRank
Visual Question AnsweringVQA v2 (test-dev)
Overall Accuracy75.44
664
Visual Question AnsweringVQA v2 (test-std)
Accuracy75.56
466
Text-to-Image RetrievalFlickr30K
R@186.5
460
Text-to-Image RetrievalFlickr30k (test)
Recall@183.6
423
Image-to-Text RetrievalFlickr30K
R@197.1
379
Video Question AnsweringMSRVTT-QA (test)
Accuracy44.7
371
Image-to-Text RetrievalFlickr30k (test)
R@194.2
370
Video-to-Text retrievalMSR-VTT
Recall@150.6
157
Image-to-Text RetrievalMSCOCO
R@180.1
124
Text-to-Image RetrievalMSCOCO
R@162.6
118
Showing 10 of 24 rows

Other info

Follow for update