Multimodal Few-Shot Learning with Frozen Language Models

About

When trained at sufficient scale, auto-regressive language models exhibit the notable ability to learn a new language task after being prompted with just a few examples. Here, we present a simple, yet effective, approach for transferring this few-shot learning ability to a multimodal setting (vision and language). Using aligned image and caption data, we train a vision encoder to represent each image as a sequence of continuous embeddings, such that a pre-trained, frozen language model prompted with this prefix generates the appropriate caption. The resulting system is a multimodal few-shot learner, with the surprising ability to learn a variety of new tasks when conditioned on examples, represented as a sequence of multiple interleaved image and text embeddings. We demonstrate that it can rapidly learn words for new objects and novel visual categories, do visual question-answering with only a handful of examples, and make use of outside knowledge, by measuring a single model on a variety of established and new benchmarks.

Maria Tsimpoukelli, Jacob Menick, Serkan Cabi, S. M. Ali Eslami, Oriol Vinyals, Felix Hill• 2021

Related benchmarks

Task	Dataset	Result
Visual Question Answering	VQA v2	Accuracy38.2	1429
Visual Question Answering	VQA v2 (test-dev)	Overall Accuracy48.4	712
Visual Question Answering	OK-VQA (test)	Accuracy12.6	327
Visual Question Answering	GQA (test-dev)	--	236
Visual Question Answering	VQA 2.0 (val)	Accuracy (Overall)38.2	183
Visual Question Answering	VQA v2 (val)	Accuracy29.6	158
Visual Question Answering	OKVQA (val)	VQA Score12.6	101
Visual Question Answering	OK-VQA (val)	Accuracy5.9	47
Few-shot classification	miniImageNet Open-Ended 5-Way (test)	Accuracy34.7	35
Few-shot Image Classification	miniImageNet Open-Ended 2-Way	Accuracy66	35

Showing 10 of 16 rows

Other info

Follow for update

@wizwand_team Discord