Grounding Language Models to Images for Multimodal Inputs and Outputs
About
We propose an efficient method to ground pretrained text-only language models to the visual domain, enabling them to process arbitrarily interleaved image-and-text data, and generate text interleaved with retrieved images. Our method leverages the abilities of language models learnt from large scale text-only pretraining, such as in-context learning and free-form text generation. We keep the language model frozen, and finetune input and output linear layers to enable cross-modality interactions. This allows our model to process arbitrarily interleaved image-and-text inputs, and generate free-form text interleaved with retrieved images. We achieve strong zero-shot performance on grounded tasks such as contextual image retrieval and multimodal dialogue, and showcase compelling interactive abilities. Our approach works with any off-the-shelf language model and paves the way towards an effective, general solution for leveraging pretrained language models in visually grounded settings.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Image-to-Text Retrieval | MSCOCO | R@126.8 | 124 | |
| Text-to-Image Retrieval | MSCOCO | R@123.4 | 118 | |
| Visual Question Answering | VQA v2 (val) | Accuracy28.51 | 99 | |
| Visual Dialog | VisDial 1.0 (val) | MRR0.22 | 65 | |
| Image Captioning | COCO 2017 (val) | -- | 24 | |
| Retrieval | VisDial (test) | Avg R40.57 | 12 | |
| Contextual Image Retrieval | VIST | R@118.2 | 10 | |
| Interleave Retrieval | COCO-Entity | IR@524.1 | 4 | |
| Interleave Retrieval | COCO-Paragraph | IR@50.26 | 4 |