SOD: Step-wise On-policy Distillation for Small Language Model Agents

About

Tool-integrated reasoning (TIR) is difficult to scale to small language models due to instability in long-horizon tool interactions and limited model capacity. While reinforcement learning methods like group relative policy optimization provide only sparse outcome-level rewards. Recently, on-policy distillation (OPD) has gained popularity by supplying dense token-level supervision from a teacher on student-generated trajectories. However, our experiments indicate that applying OPD to TIR leads to a critical failure mode: erroneous tool calls tend to cascade across subsequent reasoning steps, progressively amplifying student-teacher divergence and rendering the teacher's token-level supervision increasingly unreliable. To address this, we propose SOD, a step-wise on-policy distillation framework for small language model agents, which adaptively reweights distillation strength at each step based on step-level divergence. Therefore, SOD can attenuate potentially misleading teacher signals in high-divergence regions while preserving dense guidance in well-aligned states. Experiments on challenging math, science, and code benchmarks show that SOD achieves up to 20.86% improvement over the second-best baseline. Notably, our 0.6B student achieves 26.13% on AIME 2025, demonstrating effective transfer of agentic reasoning to lightweight models. Our code is available at https://github.com/YoungZ365/SOD.

Qiyong Zhong, Mao Zheng, Mingyang Song, Xin Lin, Jie Sun, Houcheng Jiang, Xiang Wang, Junfeng Fang• 2026

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	AIME 2024	Average Score (avg@32)50.83	53
Multi-turn Agent Interaction	ALFWorld ScienceWorld WebShop Average	Success Rate (SR)68.28	39
Multi-turn Agent Interaction	Webshop	Success Rate80.47	39
Multi-turn agent task	Search-QA	Success Rate48.58	31
Coding	LiveCodeBench	Acc (avg@32)40.63	29
Multi-turn agent task	AlfWorld	Success Rate79.69	23
Mathematical Reasoning	AIME 2025	Average Score (32 samples)41.72	15
Scientific Reasoning	GPQA Diamond	Average@3238.72	15

Showing 8 of 8 rows

Other info

Follow for update

@wizwand_team Discord