Embarrassingly Simple Self-Distillation Improves Code Generation

About

Can a large language model (LLM) improve at code generation using only its own raw outputs, without a verifier, a teacher model, or reinforcement learning? We answer in the affirmative with simple self-distillation (SSD): sample solutions from the model with certain temperature and truncation configurations, then fine-tune on those samples with standard supervised fine-tuning. SSD improves Qwen3-30B-Instruct from 42.4% to 55.3% pass@1 on LiveCodeBench v6, with gains concentrating on harder problems, and it generalizes across Qwen and Llama models at 4B, 8B, and 30B scale, including both instruct and thinking variants. To understand why such a simple method can work, we trace these gains to a precision-exploration conflict in LLM decoding and show that SSD reshapes token distributions in a context-dependent way, suppressing distractor tails where precision matters while preserving useful diversity where exploration matters. Taken together, SSD offers a complementary post-training direction for improving LLM code generation. Our code is available at https://github.com/apple/ml-ssd

Ruixiang Zhang, Richard He Bai, Huangjie Zheng, Navdeep Jaitly, Ronan Collobert, Yizhe Zhang• 2026

Related benchmarks

Task	Dataset	Result
Science Question Answering	ScienceQA	Accuracy80.8	916
Question Answering	BBH	--	33
Tool Use	ToolAlpaca	Tool Use Success Rate55.9	26
Question Answering	MMLU	Answer-letter Accuracy73.5	20
Code Generation	CodeAlpaca 20k	NLL0.608	20
Expert-level Science Question Answering	GPQA	Accuracy33.7	14
Commonsense Reasoning	CoS-E	Accuracy79.9	14

Showing 7 of 7 rows

Other info

GitHub

Follow for update

@wizwand_team Discord