Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

VLIS: Unimodal Language Models Guide Multimodal Language Generation

About

Multimodal language generation, which leverages the synergy of language and vision, is a rapidly expanding field. However, existing vision-language models face challenges in tasks that require complex linguistic understanding. To address this issue, we introduce Visual-Language models as Importance Sampling weights (VLIS), a novel framework that combines the visual conditioning capability of vision-language models with the language understanding of unimodal text-only language models without further training. It extracts pointwise mutual information of each image and text from a visual-language model and uses the value as an importance sampling weight to adjust the token likelihood from a text-only model. VLIS improves vision-language models on diverse tasks, including commonsense understanding (WHOOPS, OK-VQA, and ScienceQA) and complex text generation (Concadia, Image Paragraph Captioning, and ROCStories). Our results suggest that VLIS represents a promising new direction for multimodal language generation.

Jiwan Chung, Youngjae Yu• 2023

Related benchmarks

TaskDatasetResultRank
Science Question AnsweringScienceQA (test)
Average Accuracy50.2
208
Visual Question AnsweringVQA v2 (val)
Accuracy53.6
99
Visual Question AnsweringOK-VQA (val)
Accuracy34.2
47
Paragraph CaptioningKrause 2017 (test)
METEOR14.6
10
Identification of weird imagesWHOOPS
Accuracy80
9
Contextual Image CaptioningConcadia (test)
CIDEr44.1
8
Image Description GenerationConcadia (test)
CIDEr28.3
7
Story GenerationROCStories 2016
Repetition Score (rep-2)2.31
5
Showing 8 of 8 rows

Other info

Code

Follow for update