How Much Can CLIP Benefit Vision-and-Language Tasks?
About
Most existing Vision-and-Language (V&L) models rely on pre-trained visual encoders, using a relatively small set of manually-annotated data (as compared to web-crawled data), to perceive the visual world. However, it has been observed that large-scale pretraining usually can result in better generalization performance, e.g., CLIP (Contrastive Language-Image Pre-training), trained on a massive amount of image-caption pairs, has shown a strong zero-shot capability on various vision tasks. To further study the advantage brought by CLIP, we propose to use CLIP as the visual encoder in various V&L models in two typical scenarios: 1) plugging CLIP into task-specific fine-tuning; 2) combining CLIP with V&L pre-training and transferring to downstream tasks. We show that CLIP significantly outperforms widely-used visual encoders trained with in-domain annotated data, such as BottomUp-TopDown. We achieve competitive or better results on diverse V&L tasks, while establishing new state-of-the-art results on Visual Question Answering, Visual Entailment, and V&L Navigation tasks. We release our code at https://github.com/clip-vil/CLIP-ViL.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Image Captioning | MS COCO Karpathy (test) | CIDEr1.342 | 682 | |
| Visual Question Answering | VQA v2 (test-dev) | Overall Accuracy76.5 | 664 | |
| Visual Question Answering | VQA v2 (test-std) | Accuracy76.94 | 466 | |
| Visual Question Answering | VQA 2.0 (test-dev) | Accuracy76.48 | 337 | |
| Vision-and-Language Navigation | R2R (val unseen) | Success Rate (SR)59.2 | 260 | |
| Visual Entailment | SNLI-VE (test) | Overall Accuracy80.2 | 197 | |
| Visual Question Answering | GQA (test-dev) | Accuracy61.42 | 178 | |
| Vision-Language Navigation | RxR-CE (val-unseen) | SR42.6 | 172 | |
| Visual Question Answering | VQA 2.0 (val) | Accuracy (Overall)23.07 | 143 | |
| Vision-Language Navigation | R2R (test unseen) | SR59 | 122 |