Eagle 2: Building Post-Training Data Strategies from Scratch for Frontier Vision-Language Models
About
Recently, promising progress has been made by open-source vision-language models (VLMs) in bringing their capabilities closer to those of proprietary frontier models. However, most open-source models only publish their final model weights, leaving the critical details of data strategies and implementation largely opaque. In this work, we address VLM post-training from a data-centric perspective, showing the key role of data strategy in developing frontier VLMs. By studying and building our post-training data strategy from scratch, we share detailed insights into the development processes, aiming to benefit the development of competitive models for the open-source community. Our introduced data strategy, together with training recipes and model design, leads to a family of performant VLMs named Eagle2. Specifically, Eagle2-9B achieves state-of-the-art results across various multimodal benchmarks, matching certain competitive models with up to 70B parameters.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Multimodal Understanding | MMBench | Accuracy74.9 | 637 | |
| Multimodal Understanding | MM-Vet | MM-Vet Score53.8 | 531 | |
| Visual Question Answering | ChartQA | Accuracy82.3 | 371 | |
| Multimodal Understanding | MMStar | Accuracy56.4 | 324 | |
| Visual Question Answering | AI2D | Accuracy79.3 | 249 | |
| Visual Question Answering | DocVQA | Accuracy88 | 162 | |
| Multimodal Understanding | MMMU (val) | -- | 152 | |
| Visual Question Answering | InfoVQA | Accuracy65.8 | 135 | |
| Multimodal Understanding | MME Perception | -- | 46 | |
| Multimodal Reasoning | HallusionBench | Accuracy0.458 | 42 |