Not All Correct Answers Are Equal: Why Your Distillation Source Matters

About

Distillation has emerged as a practical and effective approach to enhance the reasoning capabilities of open-source language models. In this work, we conduct a large-scale empirical study on reasoning data distillation by collecting verified outputs from three state-of-the-art teacher models-AM-Thinking-v1, Qwen3-235B-A22B, and DeepSeek-R1-on a shared corpus of 1.89 million queries. We construct three parallel datasets and analyze their distributions, revealing that AM-Thinking-v1-distilled data exhibits greater token length diversity and lower perplexity. Student models trained on each dataset are evaluated on reasoning benchmarks including AIME2024, AIME2025, MATH500, and LiveCodeBench. The model distilled from AM-Thinking-v1 consistently achieves the best performance (e.g., 84.3 on AIME2024, 72.2 on AIME2025, 98.4 on MATH500, and 65.9 on LiveCodeBench) and demonstrates adaptive output behavior-producing longer responses for harder tasks and shorter ones for simpler tasks. These findings highlight the value of high-quality, verified reasoning traces. We release the AM-Thinking-v1 and Qwen3-235B-A22B distilled datasets to support future research on open and high-performing reasoning-oriented language models. The datasets are publicly available on Hugging Face\footnote{Datasets are available on Hugging Face: \href{https://huggingface.co/datasets/a-m-team/AM-Thinking-v1-Distilled}{AM-Thinking-v1-Distilled}, \href{https://huggingface.co/datasets/a-m-team/AM-Qwen3-Distilled}{AM-Qwen3-Distilled}.}.

Xiaoyu Tian, Yunjie Ji, Haotian Wang, Shuaiting Chen, Sitong Zhao, Yiping Peng, Han Zhao, Xiangang Li• 2025

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	HMMT 2025	Accuracy41.3	241
Mathematical Reasoning	Omni-MATH	Accuracy64.5	135
Mathematical Reasoning	OlympiadBench Math	Accuracy77.5	97
Mathematical Reasoning	AIME 2025	Accuracy54.6	59
Multi-domain language model evaluation	ODA benchmark suite (test)	General Accuracy65.9	21
General Language Understanding and Reasoning	General domain benchmarks (test)	DROP Score93.3	16
Mathematical Reasoning	Math domain benchmarks (GSM8K, MATH500, Omni-Math, Olympiad, AIME'24) standard (test)	GSM8K Accuracy95.2	16
Code Generation	Code domain benchmarks	HumanEval91.5	16

Showing 8 of 8 rows

Other info

Follow for update

@wizwand_team Discord