Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

About

Vision-Language Pre-training (VLP) has advanced the performance for many vision-language tasks. However, most existing pre-trained models only excel in either understanding-based tasks or generation-based tasks. Furthermore, performance improvement has been largely achieved by scaling up the dataset with noisy image-text pairs collected from the web, which is a suboptimal source of supervision. In this paper, we propose BLIP, a new VLP framework which transfers flexibly to both vision-language understanding and generation tasks. BLIP effectively utilizes the noisy web data by bootstrapping the captions, where a captioner generates synthetic captions and a filter removes the noisy ones. We achieve state-of-the-art results on a wide range of vision-language tasks, such as image-text retrieval (+2.7% in average recall@1), image captioning (+2.8% in CIDEr), and VQA (+1.6% in VQA score). BLIP also demonstrates strong generalization ability when directly transferred to video-language tasks in a zero-shot manner. Code, models, and datasets are released at https://github.com/salesforce/BLIP.

Junnan Li, Dongxu Li, Caiming Xiong, Steven Hoi• 2022

Related benchmarks

TaskDatasetResultRank
Object Hallucination EvaluationPOPE
Accuracy73.4
2019
Visual Question AnsweringVizWiz
Accuracy19.6
1820
Visual Question AnsweringTextVQA
Accuracy42.5
1453
Visual Question AnsweringGQA
Accuracy41
1425
Multimodal UnderstandingMMBench
Accuracy22.4
847
Science Question AnsweringScienceQA
Accuracy68.02
791
Visual Question AnsweringVQA v2 (test-dev)
Overall Accuracy78.3
712
Image CaptioningMS COCO Karpathy (test)
CIDEr136.7
706
Multimodal UnderstandingMM-Vet
MM-Vet Score46.4
631
Text-to-Image RetrievalFlickr30K
R@187.3
559
Showing 10 of 344 rows
...

Other info

Code

Follow for update