Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

SOD: Step-wise On-policy Distillation for Small Language Model Agents

About

Tool-integrated reasoning (TIR) is difficult to scale to small language models due to instability in long-horizon tool interactions and limited model capacity. While reinforcement learning methods like group relative policy optimization provide only sparse outcome-level rewards. Recently, on-policy distillation (OPD) has gained popularity by supplying dense token-level supervision from a teacher on student-generated trajectories. However, our experiments indicate that applying OPD to TIR leads to a critical failure mode: erroneous tool calls tend to cascade across subsequent reasoning steps, progressively amplifying student-teacher divergence and rendering the teacher's token-level supervision increasingly unreliable. To address this, we propose SOD, a step-wise on-policy distillation framework for small language model agents, which adaptively reweights distillation strength at each step based on step-level divergence. Therefore, SOD can attenuate potentially misleading teacher signals in high-divergence regions while preserving dense guidance in well-aligned states. Experiments on challenging math, science, and code benchmarks show that SOD achieves up to 20.86% improvement over the second-best baseline. Notably, our 0.6B student achieves 26.13% on AIME 2025, demonstrating effective transfer of agentic reasoning to lightweight models. Our code is available at https://github.com/YoungZ365/SOD.

Qiyong Zhong, Mao Zheng, Mingyang Song, Xin Lin, Jie Sun, Houcheng Jiang, Xiang Wang, Junfeng Fang• 2026

Related benchmarks

TaskDatasetResultRank
Mathematical ReasoningAIME 2024
Average Score (avg@32)50.83
41
CodingLiveCodeBench
Acc (avg@32)40.63
29
Mathematical ReasoningAIME 2025
Average Score (32 samples)41.72
15
Scientific ReasoningGPQA Diamond
Average@3238.72
15
Showing 4 of 4 rows

Other info

Follow for update