Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

How Much Can CLIP Benefit Vision-and-Language Tasks?

About

Most existing Vision-and-Language (V&L) models rely on pre-trained visual encoders, using a relatively small set of manually-annotated data (as compared to web-crawled data), to perceive the visual world. However, it has been observed that large-scale pretraining usually can result in better generalization performance, e.g., CLIP (Contrastive Language-Image Pre-training), trained on a massive amount of image-caption pairs, has shown a strong zero-shot capability on various vision tasks. To further study the advantage brought by CLIP, we propose to use CLIP as the visual encoder in various V&L models in two typical scenarios: 1) plugging CLIP into task-specific fine-tuning; 2) combining CLIP with V&L pre-training and transferring to downstream tasks. We show that CLIP significantly outperforms widely-used visual encoders trained with in-domain annotated data, such as BottomUp-TopDown. We achieve competitive or better results on diverse V&L tasks, while establishing new state-of-the-art results on Visual Question Answering, Visual Entailment, and V&L Navigation tasks. We release our code at https://github.com/clip-vil/CLIP-ViL.

Sheng Shen, Liunian Harold Li, Hao Tan, Mohit Bansal, Anna Rohrbach, Kai-Wei Chang, Zhewei Yao, Kurt Keutzer• 2021

Related benchmarks

TaskDatasetResultRank
Image CaptioningMS COCO Karpathy (test)
CIDEr1.342
682
Visual Question AnsweringVQA v2 (test-dev)
Overall Accuracy76.5
664
Visual Question AnsweringVQA v2 (test-std)
Accuracy76.94
466
Visual Question AnsweringVQA 2.0 (test-dev)
Accuracy76.48
337
Vision-and-Language NavigationR2R (val unseen)
Success Rate (SR)59.2
260
Visual EntailmentSNLI-VE (test)
Overall Accuracy80.2
197
Visual Question AnsweringGQA (test-dev)
Accuracy61.42
178
Vision-Language NavigationRxR-CE (val-unseen)
SR42.6
172
Visual Question AnsweringVQA 2.0 (val)
Accuracy (Overall)23.07
143
Vision-Language NavigationR2R (test unseen)
SR59
122
Showing 10 of 26 rows

Other info

Code

Follow for update