Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding

About

We present DeepSeek-VL2, an advanced series of large Mixture-of-Experts (MoE) Vision-Language Models that significantly improves upon its predecessor, DeepSeek-VL, through two key major upgrades. For the vision component, we incorporate a dynamic tiling vision encoding strategy designed for processing high-resolution images with different aspect ratios. For the language component, we leverage DeepSeekMoE models with the Multi-head Latent Attention mechanism, which compresses Key-Value cache into latent vectors, to enable efficient inference and high throughput. Trained on an improved vision-language dataset, DeepSeek-VL2 demonstrates superior capabilities across various tasks, including but not limited to visual question answering, optical character recognition, document/table/chart understanding, and visual grounding. Our model series is composed of three variants: DeepSeek-VL2-Tiny, DeepSeek-VL2-Small and DeepSeek-VL2, with 1.0B, 2.8B and 4.5B activated parameters respectively. DeepSeek-VL2 achieves competitive or state-of-the-art performance with similar or fewer activated parameters compared to existing open-source dense and MoE-based models. Codes and pre-trained models are publicly accessible at https://github.com/deepseek-ai/DeepSeek-VL2.

Zhiyu Wu, Xiaokang Chen, Zizheng Pan, Xingchao Liu, Wen Liu, Damai Dai, Huazuo Gao, Yiyang Ma, Chengyue Wu, Bingxuan Wang, Zhenda Xie, Yu Wu, Kai Hu, Jiawei Wang, Yaofeng Sun, Yukun Li, Yishi Piao, Kang Guan, Aixin Liu, Xin Xie, Yuxiang You, Kai Dong, Xingkai Yu, Haowei Zhang, Liang Zhao, Yisong Wang, Chong Ruan• 2024

Related benchmarks

TaskDatasetResultRank
Mathematical ReasoningMathVista
Score62.8
322
OCR EvaluationOCRBench
Score834
296
Multi-discipline Multimodal UnderstandingMMMU--
266
Hallucination EvaluationPOPE
Accuracy89.4
132
Chart Question AnsweringChartQA (test)--
129
Visual UnderstandingMM-Vet
MM-Vet Score60
102
Hallucination EvaluationHallusionBench--
93
Visual Question AnsweringOCR-VQA (test)
Accuracy54.5
77
Document Visual Question AnsweringDocVQA (val)
Accuracy92.3
66
Visual Question AnsweringSEED-Bench Image
Accuracy76.8
64
Showing 10 of 103 rows
...

Other info

Follow for update