MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning

About

Since the resurgence of deep learning, vision-language models (VLMs) enhanced by large language models (LLMs) have grown exponentially in popularity. However, while LLMs can utilize extensive background knowledge and task information with in-context learning, most VLMs still struggle with understanding complex multi-modal prompts with multiple images, making VLMs less effective in downstream vision-language tasks. In this paper, we address the limitation above by 1) introducing vision-language Model with Multi-Modal In-Context Learning(MMICL), a new approach to allow the VLM to deal with multi-modal inputs efficiently; 2) proposing a novel context scheme to augment the in-context learning ability of the VLM; 3) constructing the Multi-modal In-Context Learning (MIC) dataset, designed to enhance the VLM's ability to understand complex multi-modal prompts. Our experiments confirm that MMICL achieves new state-of-the-art zero-shot performance on a wide range of general vision-language tasks, especially for complex benchmarks, including MME and MMBench. Our analysis demonstrates that MMICL effectively tackles the challenge of complex multi-modal prompt understanding and emerges the impressive ICL ability. Furthermore, we observe that MMICL successfully alleviates language bias in VLMs, a common issue for VLMs that often leads to hallucination when faced with extensive textual context. Our code, dataset, dataset tool, and model are available at https://github.com/PKUnlp-icler/MIC

Haozhe Zhao, Zefan Cai, Shuzheng Si, Xiaojian Ma, Kaikai An, Liang Chen, Zixuan Liu, Sheng Wang, Wenjuan Han, Baobao Chang• 2023

Related benchmarks

Task	Dataset	Result
Visual Question Answering	VizWiz	Accuracy50.3	1820
Visual Question Answering	VQA v2	Accuracy70.6	1429
Multimodal Evaluation	MME	--	727
Multimodal Understanding	SEED-Bench	Accuracy56.66	516
Video Question Answering	MSRVTT-QA	Accuracy42.36	505
Video Question Answering	MSVD-QA	Accuracy55.16	393
Object Hallucination	POPE Popular	F1 Score82.08	372
Object Hallucination	POPE Adversarial	Accuracy80.97	353
Science Question Answering	ScienceQA IMG	Accuracy74.92	335
Object Hallucination	POPE (Random)	F1 Score86.62	324

Showing 10 of 37 rows

Other info

Code

Follow for update

@wizwand_team Discord