CogVLM2: Visual Language Models for Image and Video Understanding
About
Beginning with VisualGLM and CogVLM, we are continuously exploring VLMs in pursuit of enhanced vision-language fusion, efficient higher-resolution architecture, and broader modalities and applications. Here we propose the CogVLM2 family, a new generation of visual language models for image and video understanding including CogVLM2, CogVLM2-Video and GLM-4V. As an image understanding model, CogVLM2 inherits the visual expert architecture with improved training recipes in both pre-training and post-training stages, supporting input resolution up to $1344 \times 1344$ pixels. As a video understanding model, CogVLM2-Video integrates multi-frame input with timestamps and proposes automated temporal grounding data construction. Notably, CogVLM2 family has achieved state-of-the-art results on benchmarks like MMBench, MM-Vet, TextVQA, MVBench and VCGBench. All models are open-sourced in https://github.com/THUDM/CogVLM2 and https://github.com/THUDM/GLM-4, contributing to the advancement of the field.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Multimodal Evaluation | MME | Score1.87e+3 | 557 | |
| Multi-discipline Multimodal Understanding | MMMU | Accuracy44.3 | 266 | |
| Visual Question Answering | ChartQA | Accuracy81 | 239 | |
| Diagram Question Answering | AI2D | AI2D Accuracy73.4 | 196 | |
| Multimodal Understanding | MMBench (MMB) | Accuracy80.5 | 69 | |
| Construction Year Estimation | YearGuessr 1.0 (test) | MAE41.5 | 32 | |
| Counter-Perception Discrimination | CP-Bench (dev) | F1 Score42.8 | 25 | |
| Counter-Perception Discrimination | CP-Bench (test) | F1 Score34.2 | 25 | |
| Visual Reasoning and Instruction Following | MM-Vet | Overall Score60.4 | 23 | |
| Date Estimation | YearGuessr (test) | MAE42.5 | 23 |