Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

CogVLM2: Visual Language Models for Image and Video Understanding

About

Beginning with VisualGLM and CogVLM, we are continuously exploring VLMs in pursuit of enhanced vision-language fusion, efficient higher-resolution architecture, and broader modalities and applications. Here we propose the CogVLM2 family, a new generation of visual language models for image and video understanding including CogVLM2, CogVLM2-Video and GLM-4V. As an image understanding model, CogVLM2 inherits the visual expert architecture with improved training recipes in both pre-training and post-training stages, supporting input resolution up to $1344 \times 1344$ pixels. As a video understanding model, CogVLM2-Video integrates multi-frame input with timestamps and proposes automated temporal grounding data construction. Notably, CogVLM2 family has achieved state-of-the-art results on benchmarks like MMBench, MM-Vet, TextVQA, MVBench and VCGBench. All models are open-sourced in https://github.com/THUDM/CogVLM2 and https://github.com/THUDM/GLM-4, contributing to the advancement of the field.

Wenyi Hong, Weihan Wang, Ming Ding, Wenmeng Yu, Qingsong Lv, Yan Wang, Yean Cheng, Shiyu Huang, Junhui Ji, Zhao Xue, Lei Zhao, Zhuoyi Yang, Xiaotao Gu, Xiaohan Zhang, Guanyu Feng, Da Yin, Zihan Wang, Ji Qi, Xixuan Song, Peng Zhang, Debing Liu, Bin Xu, Juanzi Li, Yuxiao Dong, Jie Tang• 2024

Related benchmarks

TaskDatasetResultRank
Multimodal EvaluationMME
Score1.87e+3
557
Multi-discipline Multimodal UnderstandingMMMU
Accuracy44.3
266
Visual Question AnsweringChartQA
Accuracy81
239
Diagram Question AnsweringAI2D
AI2D Accuracy73.4
196
Multimodal UnderstandingMMBench (MMB)
Accuracy80.5
69
Construction Year EstimationYearGuessr 1.0 (test)
MAE41.5
32
Counter-Perception DiscriminationCP-Bench (dev)
F1 Score42.8
25
Counter-Perception DiscriminationCP-Bench (test)
F1 Score34.2
25
Visual Reasoning and Instruction FollowingMM-Vet
Overall Score60.4
23
Date EstimationYearGuessr (test)
MAE42.5
23
Showing 10 of 17 rows

Other info

Follow for update