Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Language Is Not All You Need: Aligning Perception with Language Models

About

A big convergence of language, multimodal perception, action, and world modeling is a key step toward artificial general intelligence. In this work, we introduce Kosmos-1, a Multimodal Large Language Model (MLLM) that can perceive general modalities, learn in context (i.e., few-shot), and follow instructions (i.e., zero-shot). Specifically, we train Kosmos-1 from scratch on web-scale multimodal corpora, including arbitrarily interleaved text and images, image-caption pairs, and text data. We evaluate various settings, including zero-shot, few-shot, and multimodal chain-of-thought prompting, on a wide range of tasks without any gradient updates or finetuning. Experimental results show that Kosmos-1 achieves impressive performance on (i) language understanding, generation, and even OCR-free NLP (directly fed with document images), (ii) perception-language tasks, including multimodal dialogue, image captioning, visual question answering, and (iii) vision tasks, such as image recognition with descriptions (specifying classification via text instructions). We also show that MLLMs can benefit from cross-modal transfer, i.e., transfer knowledge from language to multimodal, and from multimodal to language. In addition, we introduce a dataset of Raven IQ test, which diagnoses the nonverbal reasoning capability of MLLMs.

Shaohan Huang, Li Dong, Wenhui Wang, Yaru Hao, Saksham Singhal, Shuming Ma, Tengchao Lv, Lei Cui, Owais Khan Mohammed, Barun Patra, Qiang Liu, Kriti Aggarwal, Zewen Chi, Johan Bjorck, Vishrav Chaudhary, Subhojit Som, Xia Song, Furu Wei• 2023

Related benchmarks

TaskDatasetResultRank
Commonsense ReasoningHellaSwag
Accuracy50
1891
Visual Question AnsweringVizWiz
Accuracy39
1525
Visual Question AnsweringVQA v2
Accuracy51.8
1362
Commonsense ReasoningWinoGrande
Accuracy54.8
1085
Commonsense ReasoningPIQA
Accuracy72.9
751
Visual Question AnsweringVQA v2 (test-dev)
Overall Accuracy46.7
706
Image CaptioningMS COCO Karpathy (test)
CIDEr0.847
682
Reading ComprehensionBoolQ
Accuracy56.4
279
Visual Question AnsweringVQA v2 (test)
Accuracy51.8
142
Image CaptioningCOCO
CIDEr101.7
130
Showing 10 of 24 rows

Other info

Follow for update