Phi-4 Technical Report
About
We present phi-4, a 14-billion parameter language model developed with a training recipe that is centrally focused on data quality. Unlike most language models, where pre-training is based primarily on organic data sources such as web content or code, phi-4 strategically incorporates synthetic data throughout the training process. While previous models in the Phi family largely distill the capabilities of a teacher model (specifically GPT-4), phi-4 substantially surpasses its teacher model on STEM-focused QA capabilities, giving evidence that our data-generation and post-training techniques go beyond distillation. Despite minimal changes to the phi-3 architecture, phi-4 achieves strong performance relative to its size -- especially on reasoning-focused benchmarks -- due to improved data, training curriculum, and innovations in the post-training scheme.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Reasoning | BBH | Accuracy87.6 | 507 | |
| Mathematical Reasoning | GSM8K | Accuracy (GSM8K)89.4 | 358 | |
| Instruction Following | IFEval | -- | 292 | |
| Instruction Following | AlpacaEval 2.0 | LC Win Rate48.1 | 281 | |
| Multi-hop Question Answering | 2WikiMultihopQA | -- | 278 | |
| Multiple-choice Question Answering | MMLU-Pro | MMLU-Pro Overall Accuracy58.22 | 116 | |
| Mathematical Reasoning | AIME 2024 (test) | Accuracy10 | 103 | |
| Natural Language Inference | MedNLI (test) | Accuracy64.26 | 89 | |
| Code Generation | MBPP Plus (test) | Accuracy63.49 | 87 | |
| Code Generation | HumanEval+ (test) | -- | 81 |