VLIS: Unimodal Language Models Guide Multimodal Language Generation
About
Multimodal language generation, which leverages the synergy of language and vision, is a rapidly expanding field. However, existing vision-language models face challenges in tasks that require complex linguistic understanding. To address this issue, we introduce Visual-Language models as Importance Sampling weights (VLIS), a novel framework that combines the visual conditioning capability of vision-language models with the language understanding of unimodal text-only language models without further training. It extracts pointwise mutual information of each image and text from a visual-language model and uses the value as an importance sampling weight to adjust the token likelihood from a text-only model. VLIS improves vision-language models on diverse tasks, including commonsense understanding (WHOOPS, OK-VQA, and ScienceQA) and complex text generation (Concadia, Image Paragraph Captioning, and ROCStories). Our results suggest that VLIS represents a promising new direction for multimodal language generation.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Science Question Answering | ScienceQA (test) | Average Accuracy50.2 | 208 | |
| Visual Question Answering | VQA v2 (val) | Accuracy53.6 | 99 | |
| Visual Question Answering | OK-VQA (val) | Accuracy34.2 | 47 | |
| Paragraph Captioning | Krause 2017 (test) | METEOR14.6 | 10 | |
| Identification of weird images | WHOOPS | Accuracy80 | 9 | |
| Contextual Image Captioning | Concadia (test) | CIDEr44.1 | 8 | |
| Image Description Generation | Concadia (test) | CIDEr28.3 | 7 | |
| Story Generation | ROCStories 2016 | Repetition Score (rep-2)2.31 | 5 |