MiMo-V2-Flash Technical Report
About
We present MiMo-V2-Flash, a Mixture-of-Experts (MoE) model with 309B total parameters and 15B active parameters, designed for fast, strong reasoning and agentic capabilities. MiMo-V2-Flash adopts a hybrid attention architecture that interleaves Sliding Window Attention (SWA) with global attention, with a 128-token sliding window under a 5:1 hybrid ratio. The model is pre-trained on 27 trillion tokens with Multi-Token Prediction (MTP), employing a native 32k context length and subsequently extended to 256k. To efficiently scale post-training compute, MiMo-V2-Flash introduces a novel Multi-Teacher On-Policy Distillation (MOPD) paradigm. In this framework, domain-specialized teachers (e.g., trained via large-scale reinforcement learning) provide dense and token-level reward, enabling the student model to perfectly master teacher expertise. MiMo-V2-Flash rivals top-tier open-weight models such as DeepSeek-V3.2 and Kimi-K2, despite using only 1/2 and 1/3 of their total parameters, respectively. During inference, by repurposing MTP as a draft model for speculative decoding, MiMo-V2-Flash achieves up to 3.6 acceptance length and 2.6x decoding speedup with three MTP layers. We open-source both the model weights and the three-layer MTP weights to foster open research and community collaboration.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Commonsense Reasoning | HellaSwag | Accuracy88.5 | 1460 | |
| Commonsense Reasoning | WinoGrande | Accuracy83.8 | 776 | |
| Question Answering | ARC Challenge | Accuracy95.9 | 749 | |
| Mathematical Reasoning | MATH | Accuracy71 | 643 | |
| Mathematical Reasoning | GSM8K | Accuracy (GSM8K)92.3 | 358 | |
| Question Answering | TriviaQA | Accuracy80.3 | 210 | |
| Code Generation | HumanEval+ | Pass@170.7 | 189 | |
| General Knowledge | MMLU | MMLU General Knowledge Accuracy86.7 | 170 | |
| Code Generation | MBPP+ | Pass@171.4 | 122 | |
| Question Answering | SimpleQA | Accuracy20.6 | 92 |