ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks
About
We present ViLBERT (short for Vision-and-Language BERT), a model for learning task-agnostic joint representations of image content and natural language. We extend the popular BERT architecture to a multi-modal two-stream model, pro-cessing both visual and textual inputs in separate streams that interact through co-attentional transformer layers. We pretrain our model through two proxy tasks on the large, automatically collected Conceptual Captions dataset and then transfer it to multiple established vision-and-language tasks -- visual question answering, visual commonsense reasoning, referring expressions, and caption-based image retrieval -- by making only minor additions to the base architecture. We observe significant improvements across tasks compared to existing task-specific models -- achieving state-of-the-art on all four tasks. Our work represents a shift away from learning groundings between vision and language only as part of task training and towards treating visual grounding as a pretrainable and transferable capability.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Visual Question Answering | VQA v2 (test-dev) | Overall Accuracy70.6 | 706 | |
| Natural Language Understanding | GLUE | SST-290.4 | 531 | |
| Natural Language Understanding | GLUE (dev) | SST-2 (Acc)90.4 | 518 | |
| Image-to-Text Retrieval | Flickr30K 1K (test) | R@158.2 | 491 | |
| Visual Question Answering | VQA v2 (test-std) | Accuracy70.92 | 486 | |
| Text-to-Image Retrieval | Flickr30k (test) | Recall@158.2 | 445 | |
| Text-to-Image Retrieval | Flickr30K 1K (test) | R@158.2 | 432 | |
| Natural Language Understanding | GLUE (test) | SST-2 Accuracy90.3 | 416 | |
| Referring Expression Comprehension | RefCOCO+ (val) | Accuracy72.34 | 354 | |
| Natural Language Visual Reasoning | NLVR2 (test-p) | Accuracy67 | 346 |