CogVLM: Visual Expert for Pretrained Language Models

About

We introduce CogVLM, a powerful open-source visual language foundation model. Different from the popular shallow alignment method which maps image features into the input space of language model, CogVLM bridges the gap between the frozen pretrained language model and image encoder by a trainable visual expert module in the attention and FFN layers. As a result, CogVLM enables deep fusion of vision language features without sacrificing any performance on NLP tasks. CogVLM-17B achieves state-of-the-art performance on 10 classic cross-modal benchmarks, including NoCaps, Flicker30k captioning, RefCOCO, RefCOCO+, RefCOCOg, Visual7W, GQA, ScienceQA, VizWiz VQA and TDIUC, and ranks the 2nd on VQAv2, OKVQA, TextVQA, COCO captioning, etc., surpassing or matching PaLI-X 55B. Codes and checkpoints are available at https://github.com/THUDM/CogVLM.

Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, Ji Qi, Yan Wang, Junhui Ji, Zhuoyi Yang, Lei Zhao, Xixuan Song, Jiazheng Xu, Bin Xu, Juanzi Li, Yuxiao Dong, Ming Ding, Jie Tang• 2023

Related benchmarks

Task	Dataset	Result
Object Hallucination Evaluation	POPE	Accuracy88	2019
Visual Question Answering	TextVQA	Accuracy70.4	1453
Visual Question Answering	VQA v2	Accuracy84.7	1429
Visual Question Answering	GQA	Accuracy59.43	1425
Visual Question Answering	VQA v2 (test-dev)	Overall Accuracy82.3	712
Image Captioning	MS COCO Karpathy (test)	CIDEr1.487	706
Multimodal Understanding	MM-Vet	MM-Vet Score54.5	631
Multimodal Reasoning	MM-Vet	MM-Vet Score52.8	517
Multimodal Understanding	SEED-Bench	Accuracy68.8	516
Mathematical Reasoning	MathVista	Score38.6	474

Showing 10 of 164 rows

...

Other info

Code

Follow for update

@wizwand_team Discord