MobileLLM-Pro Technical Report
About
Efficient on-device language models around 1 billion parameters are essential for powering low-latency AI applications on mobile and wearable devices. However, achieving strong performance in this model class, while supporting long context windows and practical deployment remains a significant challenge. We introduce MobileLLM-Pro, a 1-billion-parameter language model optimized for on-device deployment. MobileLLM-Pro achieves state-of-the-art results across 11 standard benchmarks, significantly outperforming both Gemma 3-1B and Llama 3.2-1B, while supporting context windows of up to 128,000 tokens and showing only minor performance regressions at 4-bit quantization. These improvements are enabled by four core innovations: (1) implicit positional distillation, a novel technique that effectively instills long-context capabilities through knowledge distillation; (2) a specialist model merging framework that fuses multiple domain experts into a compact model without parameter growth; (3) simulation-driven data mixing using utility estimation; and (4) 4-bit quantization-aware training with self-distillation. We release our model weights and code to support future research in efficient on-device language models.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Commonsense Reasoning | HellaSwag | HellaSwag Accuracy66.2 | 711 | |
| Mathematical Reasoning | MATH 500 | Top-1 Accuracy8.8 | 384 | |
| Commonsense Reasoning | PIQA | Accuracy76.6 | 213 | |
| Commonsense Reasoning | SIQA | Accuracy47.4 | 168 | |
| Knowledge | MMLU | Accuracy30.4 | 161 | |
| Reasoning | GSM8K | -- | 111 | |
| Instruction Following | IFEval | -- | 89 | |
| Science Reasoning | GPQA | Accuracy (GPQA)23.2 | 72 | |
| Multiple Tasks | Foundational Benchmarks Average | Average Accuracy47 | 13 | |
| Reading | BoolQ, DROP | BoolQ Accuracy68.8 | 13 |