Self-Distilled Agentic Reinforcement Learning
About
Reinforcement learning (RL) has emerged as a central paradigm for post-training LLM agents, yet its trajectory-level reward signal provides only coarse supervision for long-horizon interaction. On-Policy Self-Distillation (OPSD) complements RL by introducing dense token-level guidance from a teacher branch augmented with privileged context. However, transferring OPSD to multi-turn agents proves problematic: compounding multi-turn instability destabilizes supervision, while skill-conditioned privileged guidance requires asymmetric treatment for negative teacher rejections may arise from imperfect skills retrieval or utilization. We introduce SDAR (Self-Distilled Agentic Reinforcement Learning), which treats OPSD as a gated auxiliary objective while keeping RL as the primary optimization backbone. SDAR maps detached token-level signals into a sigmoid gate, strengthening distillation on teacher-endorsed positive-gap tokens and softly attenuating negative teacher rejections. Across the Qwen2.5 and Qwen3 families on ALFWorld, WebShop, and Search-QA, SDAR substantially improves over GRPO (+9.4% on ALFWorld, +7.0% on Search-QA, +10.2% on WebShop-Acc), avoids the instability of naive GRPO+OPSD, and consistently outperforms hybrid RL--OPSD baselines across model scales.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Embodied Task | AlfWorld | Overall Success Rate84.4 | 169 | |
| Question Answering | Search-QA | Average Score49 | 130 | |
| Web Shopping Agent | Webshop | Score89.4 | 53 | |
| Online shopping agent navigation | WebShop 128 (val) | Score89.4 | 30 | |
| Text-based embodied AI | AlfWorld | Pick Success97.1 | 30 | |
| Embodied Task Completion | AlfWorld | Pick Success Rate94.7 | 21 |