Prismer: A Vision-Language Model with Multi-Task Experts

About

Recent vision-language models have shown impressive multi-modal generation capabilities. However, typically they require training huge models on massive datasets. As a more scalable alternative, we introduce Prismer, a data- and parameter-efficient vision-language model that leverages an ensemble of task-specific experts. Prismer only requires training of a small number of components, with the majority of network weights inherited from multiple readily-available, pre-trained experts, and kept frozen during training. By leveraging experts from a wide range of domains, we show Prismer can efficiently pool this expert knowledge and adapt it to various vision-language reasoning tasks. In our experiments, we show that Prismer achieves fine-tuned and few-shot learning performance which is competitive with current state-of-the-arts, whilst requiring up to two orders of magnitude less training data. Code is available at https://github.com/NVlabs/prismer.

Shikun Liu, Linxi Fan, Edward Johns, Zhiding Yu, Chaowei Xiao, Anima Anandkumar• 2023

Related benchmarks

Task	Dataset	Result
Visual Question Answering	VQA v2 (test-dev)	Overall Accuracy78.4	712
Image Captioning	MS COCO Karpathy (test)	CIDEr1.365	706
Visual Question Answering	VQA v2 (test-std)	Accuracy78.5	486
Image Captioning	COCO (Karpathy split)	CIDEr136.5	74
Image Captioning	NoCaps (test)	CIDEr (overall)110.8	61
Image Captioning	NoCaps 1.0 (val)	Overall Score112.9	32
Resource Efficiency Analysis	Vision-Language Models General	Pre-training Cost (PFlops Days)0.66	8
Image Captioning	COCO (val)	CLIPScore76.7	7
Image Captioning	COCO (test)	CLIPScore76.7	7

Showing 9 of 9 rows

Other info

Code

Follow for update

@wizwand_team Discord