Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

StepOPSD: Step-Aware Online Preference Distillation for Agent Reinforcement Learning

About

Reinforcement learning for multi-turn agents suffers from a credit-assignment mismatch: rewards are sparse and trajectory-level, while success often hinges on a few local decisions. Existing online policy distillation (OPD) provides denser token-level supervision, but typically treats heterogeneous agent trajectories as monolithic strings rather than causal interaction units. We present StepOPSD, a post-rollout preference self-distillation framework that takes the agent step as the unit of credit redistribution. StepOPSD decomposes trajectories into action-centered step segments, rescoring them under hindsight-enriched teacher contexts and converting token-level log-probability gaps into sign-preserving advantage shaping with a normalized per-step credit budget before the GRPO update. Across ALFWorld and Search-QA with Qwen3-1.7B and Qwen2.5-3B-Instruct, StepOPSD attains best or second-best results on subsets most sensitive to local causal errors, including first-place performance on ALFWorld Heat (79.1%), PickTwo (95.0%), Search-QA TriviaQA (61.6%), and tied-best performance on HotpotQA (40.4%). The results further reveal a consistent two-knob law: smaller {\alpha}_clip acts as a broadly stabilizing local trust region, whereas the optimal global mixing strength {\lambda}_mix remains task-dependent. These findings suggest that step-aware distillation is most useful when trajectory-level rewards are weakly aligned with the local action that determines downstream success.

Yanfei Zhang, Xu Lin, Chenglin Wu• 2026

Related benchmarks

TaskDatasetResultRank
Embodied TaskAlfWorld
Overall Success Rate83.6
169
Question AnsweringSearch-QA
Average Score45.7
130
Showing 2 of 2 rows

Other info

Follow for update