Phi-4-reasoning Technical Report

About

We introduce Phi-4-reasoning, a 14-billion parameter reasoning model that achieves strong performance on complex reasoning tasks. Trained via supervised fine-tuning of Phi-4 on carefully curated set of "teachable" prompts-selected for the right level of complexity and diversity-and reasoning demonstrations generated using o3-mini, Phi-4-reasoning generates detailed reasoning chains that effectively leverage inference-time compute. We further develop Phi-4-reasoning-plus, a variant enhanced through a short phase of outcome-based reinforcement learning that offers higher performance by generating longer reasoning traces. Across a wide range of reasoning tasks, both models outperform significantly larger open-weight models such as DeepSeek-R1-Distill-Llama-70B model and approach the performance levels of full DeepSeek-R1 model. Our comprehensive evaluations span benchmarks in math and scientific reasoning, coding, algorithmic problem solving, planning, and spatial understanding. Interestingly, we observe a non-trivial transfer of improvements to general-purpose benchmarks as well. In this report, we provide insights into our training data, our training methodologies, and our evaluations. We show that the benefit of careful data curation for supervised fine-tuning (SFT) extends to reasoning language models, and can be further amplified by reinforcement learning (RL). Finally, our evaluation points to opportunities for improving how we assess the performance and robustness of reasoning models.

Marah Abdin, Sahaj Agarwal, Ahmed Awadallah, Vidhisha Balachandran, Harkirat Behl, Lingjiao Chen, Gustavo de Rosa, Suriya Gunasekar, Mojan Javaheripi, Neel Joshi, Piero Kauffmann, Yash Lara, Caio C\'esar Teodoro Mendes, Arindam Mitra, Besmira Nushi, Dimitris Papailiopoulos, Olli Saarikivi, Shital Shah, Vaishnavi Shrivastava, Vibhav Vineet, Yue Wu, Safoora Yousefi, Guoqing Zheng• 2025

Related benchmarks

Task	Dataset	Result
Science Question Answering	ScienceQA	Accuracy99.8	791
Question Answering	ARC Challenge	Accuracy (ARC)72.8	598
Mathematical Reasoning	AIME 2024	Accuracy3.3	479
Mathematical Reasoning	GSM8K	Accuracy100	388
Graduate-level Science Reasoning	GPQA	Accuracy64.8	121
Factuality Evaluation	TruthfulQA	MC243.21	103
Mathematical Reasoning	MATH500	Accuracy94.4	82
Scientific Reasoning	GPQA Diamond	Score72.7	68
Mathematical Reasoning	AIME’25, AIME’24, AMC’23, and MATH500 Average (test)	Acc@472.7	66
Question Answering	GPQA Diamond	Accuracy25.6	61

Showing 10 of 58 rows

Other info

Follow for update

@wizwand_team Discord