What and When to Distill: Selective Hindsight Distillation for Multi-Turn Agents
About
Reinforcement learning can train LLM agents from sparse task rewards, but long-horizon credit assignment remains challenging: a single success-or-failure signal must be distributed across many actions. Existing methods rely on trajectory-level rewards or proxy signals, without fully leveraging per-step environmental feedback. Multi-turn agent settings are underexplored, where feedback can include error messages, page changes, observations, or reference trajectories. We systematically study five feedback sources and two insertion granularities and introduce SERL, a selective environment-reweighted learning framework. SERL uses the task reward to determine update direction, while environment feedback adjusts placement and magnitude, focusing on critical actions. On ALFWorld and WebShop, SERL achieves 90.0% and 80.1% success, outperforming strong RL and distillation baselines. Analysis shows that grounded, action-relevant feedback at meaningful points consistently outperforms indiscriminate use of longer or richer context.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Online Shopping | WebShop (test) | Score89.5 | 59 | |
| Multi-turn Agent Interaction | ALFWorld (test) | Success Rate (Pick)92.3 | 31 |