Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks

About

We present ViLBERT (short for Vision-and-Language BERT), a model for learning task-agnostic joint representations of image content and natural language. We extend the popular BERT architecture to a multi-modal two-stream model, pro-cessing both visual and textual inputs in separate streams that interact through co-attentional transformer layers. We pretrain our model through two proxy tasks on the large, automatically collected Conceptual Captions dataset and then transfer it to multiple established vision-and-language tasks -- visual question answering, visual commonsense reasoning, referring expressions, and caption-based image retrieval -- by making only minor additions to the base architecture. We observe significant improvements across tasks compared to existing task-specific models -- achieving state-of-the-art on all four tasks. Our work represents a shift away from learning groundings between vision and language only as part of task training and towards treating visual grounding as a pretrainable and transferable capability.

Jiasen Lu, Dhruv Batra, Devi Parikh, Stefan Lee• 2019

Related benchmarks

TaskDatasetResultRank
Visual Question AnsweringVQA v2 (test-dev)
Overall Accuracy70.6
664
Natural Language UnderstandingGLUE (dev)
SST-2 (Acc)90.4
504
Visual Question AnsweringVQA v2 (test-std)
Accuracy70.92
466
Natural Language UnderstandingGLUE
SST-290.4
452
Image-to-Text RetrievalFlickr30K 1K (test)
R@158.2
439
Text-to-Image RetrievalFlickr30k (test)
Recall@158.2
423
Natural Language UnderstandingGLUE (test)
SST-2 Accuracy90.3
416
Text-to-Image RetrievalFlickr30K 1K (test)
R@158.2
375
Referring Expression ComprehensionRefCOCO+ (val)
Accuracy72.34
345
Visual Question AnsweringVQA 2.0 (test-dev)
Accuracy71.16
337
Showing 10 of 119 rows
...

Other info

Follow for update