Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Not All Correct Answers Are Equal: Why Your Distillation Source Matters

About

Distillation has emerged as a practical and effective approach to enhance the reasoning capabilities of open-source language models. In this work, we conduct a large-scale empirical study on reasoning data distillation by collecting verified outputs from three state-of-the-art teacher models-AM-Thinking-v1, Qwen3-235B-A22B, and DeepSeek-R1-on a shared corpus of 1.89 million queries. We construct three parallel datasets and analyze their distributions, revealing that AM-Thinking-v1-distilled data exhibits greater token length diversity and lower perplexity. Student models trained on each dataset are evaluated on reasoning benchmarks including AIME2024, AIME2025, MATH500, and LiveCodeBench. The model distilled from AM-Thinking-v1 consistently achieves the best performance (e.g., 84.3 on AIME2024, 72.2 on AIME2025, 98.4 on MATH500, and 65.9 on LiveCodeBench) and demonstrates adaptive output behavior-producing longer responses for harder tasks and shorter ones for simpler tasks. These findings highlight the value of high-quality, verified reasoning traces. We release the AM-Thinking-v1 and Qwen3-235B-A22B distilled datasets to support future research on open and high-performing reasoning-oriented language models. The datasets are publicly available on Hugging Face\footnote{Datasets are available on Hugging Face: \href{https://huggingface.co/datasets/a-m-team/AM-Thinking-v1-Distilled}{AM-Thinking-v1-Distilled}, \href{https://huggingface.co/datasets/a-m-team/AM-Qwen3-Distilled}{AM-Qwen3-Distilled}.}.

Xiaoyu Tian, Yunjie Ji, Haotian Wang, Shuaiting Chen, Sitong Zhao, Yiping Peng, Han Zhao, Xiangang Li• 2025

Related benchmarks

TaskDatasetResultRank
Mathematical ReasoningOlympiadBench Math
Accuracy77.5
84
Mathematical ReasoningOmni-MATH
Accuracy64.5
68
Mathematical ReasoningHMMT 2025
Accuracy41.3
38
Mathematical ReasoningAIME 2025
Accuracy54.6
37
Multi-domain language model evaluationODA benchmark suite (test)
General Accuracy65.9
21
General Language Understanding and ReasoningGeneral domain benchmarks (test)
DROP Score93.3
16
Mathematical ReasoningMath domain benchmarks (GSM8K, MATH500, Omni-Math, Olympiad, AIME'24) standard (test)
GSM8K Accuracy95.2
16
Code GenerationCode domain benchmarks
HumanEval91.5
16
Showing 8 of 8 rows

Other info

Follow for update