ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks
About
We present ViLBERT (short for Vision-and-Language BERT), a model for learning task-agnostic joint representations of image content and natural language. We extend the popular BERT architecture to a multi-modal two-stream model, pro-cessing both visual and textual inputs in separate streams that interact through co-attentional transformer layers. We pretrain our model through two proxy tasks on the large, automatically collected Conceptual Captions dataset and then transfer it to multiple established vision-and-language tasks -- visual question answering, visual commonsense reasoning, referring expressions, and caption-based image retrieval -- by making only minor additions to the base architecture. We observe significant improvements across tasks compared to existing task-specific models -- achieving state-of-the-art on all four tasks. Our work represents a shift away from learning groundings between vision and language only as part of task training and towards treating visual grounding as a pretrainable and transferable capability.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Visual Question Answering | VQA v2 (test-dev) | Overall Accuracy70.6 | 664 | |
| Natural Language Understanding | GLUE (dev) | SST-2 (Acc)90.4 | 504 | |
| Visual Question Answering | VQA v2 (test-std) | Accuracy70.92 | 466 | |
| Natural Language Understanding | GLUE | SST-290.4 | 452 | |
| Image-to-Text Retrieval | Flickr30K 1K (test) | R@158.2 | 439 | |
| Text-to-Image Retrieval | Flickr30k (test) | Recall@158.2 | 423 | |
| Natural Language Understanding | GLUE (test) | SST-2 Accuracy90.3 | 416 | |
| Text-to-Image Retrieval | Flickr30K 1K (test) | R@158.2 | 375 | |
| Referring Expression Comprehension | RefCOCO+ (val) | Accuracy72.34 | 345 | |
| Visual Question Answering | VQA 2.0 (test-dev) | Accuracy71.16 | 337 |