BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

About

The cost of vision-and-language pre-training has become increasingly prohibitive due to end-to-end training of large-scale models. This paper proposes BLIP-2, a generic and efficient pre-training strategy that bootstraps vision-language pre-training from off-the-shelf frozen pre-trained image encoders and frozen large language models. BLIP-2 bridges the modality gap with a lightweight Querying Transformer, which is pre-trained in two stages. The first stage bootstraps vision-language representation learning from a frozen image encoder. The second stage bootstraps vision-to-language generative learning from a frozen language model. BLIP-2 achieves state-of-the-art performance on various vision-language tasks, despite having significantly fewer trainable parameters than existing methods. For example, our model outperforms Flamingo80B by 8.7% on zero-shot VQAv2 with 54x fewer trainable parameters. We also demonstrate the model's emerging capabilities of zero-shot image-to-text generation that can follow natural language instructions.

Junnan Li, Dongxu Li, Silvio Savarese, Steven Hoi• 2023

Related benchmarks

Task	Dataset	Result
Object Hallucination Evaluation	POPE	Accuracy85.5	2019
Visual Question Answering	VizWiz	Accuracy53.8	1820
Visual Question Answering	TextVQA	Accuracy60.7	1453
Visual Question Answering	VQA v2	Accuracy82.2	1429
Visual Question Answering	GQA	Accuracy63.5	1425
Text-based Visual Question Answering	TextVQA	Accuracy42.5	962
Multimodal Understanding	MMBench	Accuracy51.7	847
Science Question Answering	ScienceQA	Accuracy70.45	791
Multimodal Evaluation	MME	Score1.55e+3	727
Visual Question Answering	VQA v2 (test-dev)	Overall Accuracy82.2	712

Showing 10 of 546 rows

...

Other info

Code

Follow for update

@wizwand_team Discord