Grounding Language Models to Images for Multimodal Inputs and Outputs

About

We propose an efficient method to ground pretrained text-only language models to the visual domain, enabling them to process arbitrarily interleaved image-and-text data, and generate text interleaved with retrieved images. Our method leverages the abilities of language models learnt from large scale text-only pretraining, such as in-context learning and free-form text generation. We keep the language model frozen, and finetune input and output linear layers to enable cross-modality interactions. This allows our model to process arbitrarily interleaved image-and-text inputs, and generate free-form text interleaved with retrieved images. We achieve strong zero-shot performance on grounded tasks such as contextual image retrieval and multimodal dialogue, and showcase compelling interactive abilities. Our approach works with any off-the-shelf language model and paves the way towards an effective, general solution for leveraging pretrained language models in visually grounded settings.

Jing Yu Koh, Ruslan Salakhutdinov, Daniel Fried• 2023

Related benchmarks

Task	Dataset	Result
Visual Question Answering	VQA v2 (val)	Accuracy28.51	158
Image-to-Text Retrieval	MSCOCO	R@126.8	152
Text-to-Image Retrieval	MSCOCO	R@123.4	142
Visual Dialog	VisDial 1.0 (val)	MRR0.22	75
Image Captioning	COCO 2017 (val)	--	24
Conversational Image Retrieval (tChatSearch)	ChatSearch (test)	Recall@115.94	16
Retrieval	VisDial (test)	Avg R40.57	12
Contextual Image Retrieval	VIST	R@118.2	10
Interleave Retrieval	COCO-Entity	IR@524.1	4
Interleave Retrieval	COCO-Paragraph	IR@50.26	4

Showing 10 of 10 rows

Other info

Follow for update

@wizwand_team Discord