InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
About
We introduce InternVL3, a significant advancement in the InternVL series featuring a native multimodal pre-training paradigm. Rather than adapting a text-only large language model (LLM) into a multimodal large language model (MLLM) that supports visual inputs, InternVL3 jointly acquires multimodal and linguistic capabilities from both diverse multimodal data and pure-text corpora during a single pre-training stage. This unified training paradigm effectively addresses the complexities and alignment challenges commonly encountered in conventional post-hoc training pipelines for MLLMs. To further improve performance and scalability, InternVL3 incorporates variable visual position encoding (V2PE) to support extended multimodal contexts, employs advanced post-training techniques such as supervised fine-tuning (SFT) and mixed preference optimization (MPO), and adopts test-time scaling strategies alongside an optimized training infrastructure. Extensive empirical evaluations demonstrate that InternVL3 delivers superior performance across a wide range of multi-modal tasks. In particular, InternVL3-78B achieves a score of 72.2 on the MMMU benchmark, setting a new state-of-the-art among open-source MLLMs. Its capabilities remain highly competitive with leading proprietary models, including ChatGPT-4o, Claude 3.5 Sonnet, and Gemini 2.5 Pro, while also maintaining strong pure-language proficiency. In pursuit of open-science principles, we will publicly release both the training data and model weights to foster further research and development in next-generation MLLMs.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Object Hallucination Evaluation | POPE | Accuracy89.6 | 935 | |
| Multimodal Evaluation | MME | Score2.22e+3 | 557 | |
| Text-based Visual Question Answering | TextVQA | Accuracy80.2 | 496 | |
| Multimodal Understanding | MM-Vet | MM-Vet Score81.3 | 418 | |
| Mathematical Reasoning | MATH500 (test) | -- | 381 | |
| Multimodal Understanding | MMBench | -- | 367 | |
| Referring Expression Comprehension | RefCOCO+ (val) | Accuracy88.2 | 345 | |
| Referring Expression Comprehension | RefCOCO (val) | Accuracy92.5 | 335 | |
| Referring Expression Comprehension | RefCOCO (testA) | Accuracy0.946 | 333 | |
| Mathematical Reasoning | MathVista | Score71.6 | 322 |