Self-Distilled Agentic Reinforcement Learning

About

Reinforcement learning (RL) has emerged as a central paradigm for post-training LLM agents, yet its trajectory-level reward signal provides only coarse supervision for long-horizon interaction. On-Policy Self-Distillation (OPSD) complements RL by introducing dense token-level guidance from a teacher branch augmented with privileged context. However, transferring OPSD to multi-turn agents proves problematic: compounding multi-turn instability destabilizes supervision, while skill-conditioned privileged guidance requires asymmetric treatment for negative teacher rejections may arise from imperfect skills retrieval or utilization. We introduce SDAR (Self-Distilled Agentic Reinforcement Learning), which treats OPSD as a gated auxiliary objective while keeping RL as the primary optimization backbone. SDAR maps detached token-level signals into a sigmoid gate, strengthening distillation on teacher-endorsed positive-gap tokens and softly attenuating negative teacher rejections. Across the Qwen2.5 and Qwen3 families on ALFWorld, WebShop, and Search-QA, SDAR substantially improves over GRPO (+9.4% on ALFWorld, +7.0% on Search-QA, +10.2% on WebShop-Acc), avoids the instability of naive GRPO+OPSD, and consistently outperforms hybrid RL--OPSD baselines across model scales.

Zhengxi Lu, Zhiyuan Yao, Zhuowen Han, Zi-Han Wang, Jinyang Wu, Qi Gu, Xunliang Cai, Weiming Lu, Jun Xiao, Yueting Zhuang, Yongliang Shen• 2026

Related benchmarks

Task	Dataset	Result
Interactive Decision-making	AlfWorld	--	398
Web Navigation and Shopping	Webshop	Score89.4	248
Embodied Task	AlfWorld	Overall Success Rate84.4	183
Question Answering	Search-QA	Average Score49	177
Web Shopping Agent	Webshop	Success Rate (SR)82.8	72
Interactive Task Completion	AlfWorld	Pick Success Rate97.1	72
Embodied Task Completion	AlfWorld	Pick Success Rate97.1	54
Multi-turn Agent Interaction	Webshop	Success Rate67.97	39
Multi-turn Agent Interaction	ALFWorld ScienceWorld WebShop Average	Success Rate (SR)64.37	39
E-commerce interaction	WebShop 128 tasks (test)	Score89.4	33

Showing 10 of 15 rows

Other info

GitHub

Follow for update

@wizwand_team Discord