SAIL-VL2 Technical Report
About
We introduce SAIL-VL2, an open-suite vision-language foundation model (LVM) for comprehensive multimodal understanding and reasoning. As the successor to SAIL-VL, SAIL-VL2 achieves state-of-the-art performance at the 2B and 8B parameter scales across diverse image and video benchmarks, demonstrating strong capabilities from fine-grained perception to complex reasoning. Its effectiveness is driven by three core innovations. First, a large-scale data curation pipeline with scoring and filtering strategies enhances both quality and distribution across captioning, OCR, QA, and video data, improving training efficiency. Second, a progressive training framework begins with a powerful pre-trained vision encoder (SAIL-ViT), advances through multimodal pre-training, and culminates in a thinking-fusion SFT-RL hybrid paradigm that systematically strengthens model capabilities. Third, architectural advances extend beyond dense LLMs to efficient sparse Mixture-of-Experts (MoE) designs. With these contributions, SAIL-VL2 demonstrates competitive performance across 106 datasets and achieves state-of-the-art results on challenging reasoning benchmarks such as MMMU and MathVista. Furthermore, on the OpenCompass leaderboard, SAIL-VL2-2B ranks first among officially released open-source models under the 4B parameter scale, while serving as an efficient and extensible foundation for the open-source multimodal community.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Mathematical Multimodal Reasoning | MathVerse | Accuracy43.2 | 221 | |
| Multimodal Math Reasoning | MathVision | Accuracy27.6 | 183 | |
| Multimodal Math Reasoning | WeMath | Accuracy35.8 | 168 | |
| Multimodal Mathematical Reasoning | OlympiadBench | Accuracy14.1 | 56 | |
| Multimodal Logical Reasoning | LogicVista | Accuracy45 | 47 | |
| Visual Discrepancy Detection | OddGridBench | Color Accuracy45 | 27 | |
| General Multimodal Understanding | General Multimodal Evaluation Suite (MMMU, MMBench, MME, ChartQA, AI2D, HallBench) | MMMU (Val)66.1 | 14 | |
| Visual Perception and Reasoning | V* Bench 1.0 (test) | Attribute Score51.3 | 13 | |
| Multimodal Understanding | Opencompass Image Benchmark (val) | MMBench Accuracy84 | 12 | |
| Universal Retrieval | Office-Home | P@173.08 | 11 |