STEP3-VL-10B Technical Report
About
We present STEP3-VL-10B, a lightweight open-source foundation model designed to redefine the trade-off between compact efficiency and frontier-level multimodal intelligence. STEP3-VL-10B is realized through two strategic shifts: first, a unified, fully unfrozen pre-training strategy on 1.2T multimodal tokens that integrates a language-aligned Perception Encoder with a Qwen3-8B decoder to establish intrinsic vision-language synergy; and second, a scaled post-training pipeline featuring over 1k iterations of reinforcement learning. Crucially, we implement Parallel Coordinated Reasoning (PaCoRe) to scale test-time compute, allocating resources to scalable perceptual reasoning that explores and synthesizes diverse visual hypotheses. Consequently, despite its compact 10B footprint, STEP3-VL-10B rivals or surpasses models 10$\times$-20$\times$ larger (e.g., GLM-4.6V-106B, Qwen3-VL-235B) and top-tier proprietary flagships like Gemini 2.5 Pro and Seed-1.5-VL. Delivering best-in-class performance, it records 92.2% on MMBench and 80.11% on MMMU, while excelling in complex reasoning with 94.43% on AIME2025 and 75.95% on MathVision. We release the full model suite to provide the community with a powerful, efficient, and reproducible baseline.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| OCR Evaluation | OCRBench | Score86.75 | 296 | |
| Instruction Following | IFEval | -- | 292 | |
| GUI Grounding | ScreenSpot v2 | Avg Accuracy92.61 | 203 | |
| Optical Character Recognition | OCRBench | OCRBench Score89 | 83 | |
| Multimodal Reasoning | MMMU-Pro | Accuracy67.18 | 55 | |
| Counting | CountBench | Accuracy88.8 | 52 | |
| Multimodal Reasoning | MMMU | Accuracy80.11 | 44 | |
| Visual Question Answering | MMBench CN | Accuracy91.96 | 40 | |
| Mathematical Reasoning | MathVision | Accuracy75.95 | 38 | |
| Visual Question Answering | MMBench English | Accuracy92.38 | 36 |