Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

VL-Adapter: Parameter-Efficient Transfer Learning for Vision-and-Language Tasks

About

Recently, fine-tuning language models pre-trained on large text corpora have provided huge improvements on vision-and-language (V&L) tasks as well as on pure language tasks. However, fine-tuning the entire parameter set of pre-trained models becomes impractical since the model size is growing rapidly. Hence, in this paper, we introduce adapter-based parameter-efficient transfer learning techniques to V&L models such as VL-BART and VLT5. We evaluate our methods in a unified multi-task setup on both image-text and video-text benchmarks. For the image-text tasks, we use four diverse V&L datasets: VQAv2, GQA, NLVR2 , and MSCOCO image captioning. For video-text tasks, we use TVQA, How2QA, TVC, and YC2C. With careful training and thorough experiments, we benchmark three popular adapter-based methods (Adapter, Hyperformer, Compacter) against the standard full fine-tuning and the recently proposed prompt-tuning approach. We also enhance the efficiency and performance of adapters by sharing their weights to attain knowledge across tasks. Our results demonstrate that training the adapter with the weight-sharing technique (4.18% of total parameters for image-text tasks and 3.39% for video-text tasks) can match the performance of fine-tuning the entire model. Lastly, we present a comprehensive analysis including the combination of adapter and task-specific prompts and the impact of V&L pre-training on adapters. Our code is available at: https://github.com/ylsung/VL_adapter.

Yi-Lin Sung, Jaemin Cho, Mohit Bansal• 2021

Related benchmarks

TaskDatasetResultRank
Object DetectionCOCO 2017 (val)--
2454
Instance SegmentationCOCO 2017 (val)--
1144
Visual Question AnsweringVQA (test-dev)
Acc (All)68.1
147
Visual Question AnsweringVQA (test-std)
Accuracy68.3
110
Visual Question AnsweringOKVQA (val)
VQA Score34.87
101
Multi-Task AdaptationPascal Context (test)
Seg Acc70.21
70
Visual Question AnsweringGQA (test-std)
Accuracy50.9
62
Saliency DetectionPascal Context (test)--
57
Surface Normal EstimationPascal Context (test)--
50
Multi-task LearningPascal Context
mIoU (Semantic Segmentation)70.21
47
Showing 10 of 17 rows

Other info

Code

Follow for update