ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks

About

We present ViLBERT (short for Vision-and-Language BERT), a model for learning task-agnostic joint representations of image content and natural language. We extend the popular BERT architecture to a multi-modal two-stream model, pro-cessing both visual and textual inputs in separate streams that interact through co-attentional transformer layers. We pretrain our model through two proxy tasks on the large, automatically collected Conceptual Captions dataset and then transfer it to multiple established vision-and-language tasks -- visual question answering, visual commonsense reasoning, referring expressions, and caption-based image retrieval -- by making only minor additions to the base architecture. We observe significant improvements across tasks compared to existing task-specific models -- achieving state-of-the-art on all four tasks. Our work represents a shift away from learning groundings between vision and language only as part of task training and towards treating visual grounding as a pretrainable and transferable capability.

Jiasen Lu, Dhruv Batra, Devi Parikh, Stefan Lee• 2019

Related benchmarks

Task	Dataset	Result
Visual Question Answering	VQA v2 (test-dev)	Overall Accuracy70.6	712
Natural Language Understanding	GLUE	SST-290.4	551
Natural Language Understanding	GLUE (dev)	SST-2 (Acc)90.4	529
Text-to-Image Retrieval	Flickr30k (test)	Recall@158.2	525
Image-to-Text Retrieval	Flickr30K 1K (test)	R@158.2	491
Visual Question Answering	VQA v2 (test-std)	Accuracy70.92	486
Text-to-Image Retrieval	Flickr30K 1K (test)	R@158.2	432
Natural Language Understanding	GLUE (test)	SST-2 Accuracy90.3	416
Referring Expression Comprehension	RefCOCO+ (val)	Accuracy72.34	354
Natural Language Visual Reasoning	NLVR2 (test-p)	Accuracy67	346

Showing 10 of 120 rows

...

Other info

Follow for update

@wizwand_team Discord