BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

About

Vision-Language Pre-training (VLP) has advanced the performance for many vision-language tasks. However, most existing pre-trained models only excel in either understanding-based tasks or generation-based tasks. Furthermore, performance improvement has been largely achieved by scaling up the dataset with noisy image-text pairs collected from the web, which is a suboptimal source of supervision. In this paper, we propose BLIP, a new VLP framework which transfers flexibly to both vision-language understanding and generation tasks. BLIP effectively utilizes the noisy web data by bootstrapping the captions, where a captioner generates synthetic captions and a filter removes the noisy ones. We achieve state-of-the-art results on a wide range of vision-language tasks, such as image-text retrieval (+2.7% in average recall@1), image captioning (+2.8% in CIDEr), and VQA (+1.6% in VQA score). BLIP also demonstrates strong generalization ability when directly transferred to video-language tasks in a zero-shot manner. Code, models, and datasets are released at https://github.com/salesforce/BLIP.

Junnan Li, Dongxu Li, Caiming Xiong, Steven Hoi• 2022

Related benchmarks

Task	Dataset	Result
Object Hallucination Evaluation	POPE	Accuracy73.4	2056
Visual Question Answering	VizWiz	Accuracy19.6	1863
Visual Question Answering	TextVQA	Accuracy42.5	1455
Visual Question Answering	GQA	Accuracy41	1445
Science Question Answering	ScienceQA	Accuracy68.02	916
Multimodal Understanding	MMBench	Accuracy22.4	887
Visual Question Answering	VQA v2 (test-dev)	Overall Accuracy78.3	721
Image Captioning	MS COCO Karpathy (test)	CIDEr136.7	706
Semantic segmentation	ADE20K	mIoU46.9	699
Multimodal Understanding	MM-Vet	MM-Vet Score46.4	664

Showing 10 of 362 rows

...

Other info

Code

Follow for update

@wizwand_team Discord