Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

About

The cost of vision-and-language pre-training has become increasingly prohibitive due to end-to-end training of large-scale models. This paper proposes BLIP-2, a generic and efficient pre-training strategy that bootstraps vision-language pre-training from off-the-shelf frozen pre-trained image encoders and frozen large language models. BLIP-2 bridges the modality gap with a lightweight Querying Transformer, which is pre-trained in two stages. The first stage bootstraps vision-language representation learning from a frozen image encoder. The second stage bootstraps vision-to-language generative learning from a frozen language model. BLIP-2 achieves state-of-the-art performance on various vision-language tasks, despite having significantly fewer trainable parameters than existing methods. For example, our model outperforms Flamingo80B by 8.7% on zero-shot VQAv2 with 54x fewer trainable parameters. We also demonstrate the model's emerging capabilities of zero-shot image-to-text generation that can follow natural language instructions.

Junnan Li, Dongxu Li, Silvio Savarese, Steven Hoi• 2023

Related benchmarks

TaskDatasetResultRank
Visual Question AnsweringVizWiz
Accuracy53.8
1525
Object Hallucination EvaluationPOPE
Accuracy85.5
1455
Visual Question AnsweringVQA v2
Accuracy82.2
1362
Visual Question AnsweringTextVQA
Accuracy60.7
1285
Visual Question AnsweringGQA
Accuracy63.5
1249
Text-based Visual Question AnsweringTextVQA
Accuracy42.5
807
Visual Question AnsweringVQA v2 (test-dev)
Overall Accuracy82.2
706
Image CaptioningMS COCO Karpathy (test)
CIDEr145.8
682
Multimodal EvaluationMME
Score1.55e+3
658
Multimodal UnderstandingMMBench
Accuracy51.7
637
Showing 10 of 495 rows
...

Other info

Code

Follow for update