CogVLM2: Visual Language Models for Image and Video Understanding
About
Beginning with VisualGLM and CogVLM, we are continuously exploring VLMs in pursuit of enhanced vision-language fusion, efficient higher-resolution architecture, and broader modalities and applications. Here we propose the CogVLM2 family, a new generation of visual language models for image and video understanding including CogVLM2, CogVLM2-Video and GLM-4V. As an image understanding model, CogVLM2 inherits the visual expert architecture with improved training recipes in both pre-training and post-training stages, supporting input resolution up to $1344 \times 1344$ pixels. As a video understanding model, CogVLM2-Video integrates multi-frame input with timestamps and proposes automated temporal grounding data construction. Notably, CogVLM2 family has achieved state-of-the-art results on benchmarks like MMBench, MM-Vet, TextVQA, MVBench and VCGBench. All models are open-sourced in https://github.com/THUDM/CogVLM2 and https://github.com/THUDM/GLM-4, contributing to the advancement of the field.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Multimodal Evaluation | MME | Score1.87e+3 | 658 | |
| Human-Object Interaction Detection | HICO-DET (test) | mAP (full)2.08 | 544 | |
| Visual Question Answering | ChartQA | Accuracy81 | 371 | |
| Multi-discipline Multimodal Understanding | MMMU | Accuracy44.3 | 317 | |
| Diagram Question Answering | AI2D | AI2D Accuracy73.4 | 232 | |
| Diagram Question Answering | AI2D (test) | Accuracy73.4 | 142 | |
| Multimodal Understanding | MMBench (MMB) | Accuracy80.5 | 141 | |
| Goal reasoning | EgoToM | Accuracy73 | 38 | |
| Construction Year Estimation | YearGuessr 1.0 (test) | MAE41.5 | 32 | |
| Human-object interaction | VG-HOI Annotated-box | mAP2.28 | 29 |