Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Mono-InternVL: Pushing the Boundaries of Monolithic Multimodal Large Language Models with Endogenous Visual Pre-training

About

In this paper, we focus on monolithic Multimodal Large Language Models (MLLMs) that integrate visual encoding and language decoding into a single LLM. In particular, we identify that existing pre-training strategies for monolithic MLLMs often suffer from unstable optimization or catastrophic forgetting. To address this issue, our core idea is to embed a new visual parameter space into a pre-trained LLM, thereby stably learning visual knowledge from noisy data while freezing the LLM. Based on this principle, we present Mono-InternVL, a novel monolithic MLLM that seamlessly integrates a set of visual experts via a multimodal mixture-of-experts structure. Moreover, we propose an innovative pre-training strategy to maximize the visual capability of Mono-InternVL, namely Endogenous Visual Pre-training (EViP). In particular, EViP is designed as a progressive learning process for visual experts, which aims to fully exploit the visual knowledge from noisy data to high-quality data. To validate our approach, we conduct extensive experiments on 16 benchmarks. Experimental results confirm the superior performance of Mono-InternVL than existing monolithic MLLMs on 13 of 16 multimodal benchmarks, e.g., +80 points over Emu3 on OCRBench. Compared to the modular baseline, i.e., InternVL-1.5, Mono-InternVL still retains comparable multimodal performance while reducing up to 67% first token latency. Code and model are released at https://github.com/OpenGVLab/Mono-InternVL.

Gen Luo, Xue Yang, Wenhan Dou, Zhaokai Wang, Jiawen Liu, Jifeng Dai, Yu Qiao, Xizhou Zhu• 2024

Related benchmarks

TaskDatasetResultRank
Text-based Visual Question AnsweringTextVQA
Accuracy72.6
496
Multi-discipline Multimodal UnderstandingMMMU--
266
Chart Question AnsweringChartQA
Accuracy73.7
229
Visual Question AnsweringAI2D
Accuracy68.6
174
Document Visual Question AnsweringDocVQA
ANLS80
164
Optical Character Recognition EvaluationOCRBench
Score76.7
46
Infographic Visual Question AnsweringInfoVQA
Accuracy43
40
Multi-modal Vision-Language UnderstandingMMVet
Score40.1
38
General Vision-Language UnderstandingMMB
Score65.5
25
Humor DetectionYesBut
Accuracy48.2
21
Showing 10 of 14 rows

Other info

Follow for update