VLIS: Unimodal Language Models Guide Multimodal Language Generation

About

Multimodal language generation, which leverages the synergy of language and vision, is a rapidly expanding field. However, existing vision-language models face challenges in tasks that require complex linguistic understanding. To address this issue, we introduce Visual-Language models as Importance Sampling weights (VLIS), a novel framework that combines the visual conditioning capability of vision-language models with the language understanding of unimodal text-only language models without further training. It extracts pointwise mutual information of each image and text from a visual-language model and uses the value as an importance sampling weight to adjust the token likelihood from a text-only model. VLIS improves vision-language models on diverse tasks, including commonsense understanding (WHOOPS, OK-VQA, and ScienceQA) and complex text generation (Concadia, Image Paragraph Captioning, and ROCStories). Our results suggest that VLIS represents a promising new direction for multimodal language generation.

Jiwan Chung, Youngjae Yu• 2023

Related benchmarks

Task	Dataset	Result
Science Question Answering	ScienceQA (test)	Average Accuracy50.2	273
Visual Question Answering	VQA v2 (val)	Accuracy53.6	158
Visual Question Answering	OK-VQA (val)	Accuracy34.2	47
Paragraph Captioning	Krause 2017 (test)	METEOR14.6	10
Identification of weird images	WHOOPS	Accuracy80	9
Contextual Image Captioning	Concadia (test)	CIDEr44.1	8
Image Description Generation	Concadia (test)	CIDEr28.3	7
Story Generation	ROCStories 2016	Repetition Score (rep-2)2.31	5

Showing 8 of 8 rows

Other info

Code

Follow for update

@wizwand_team Discord