Reflective Prompted Policy Optimization: Trajectory-Grounded Revision and Salience Bias

About

Existing LLM-based policy optimizers see only scalar rewards: that a policy scored 0.45, but not whether the agent got stuck in a loop, fell into a hole on the third step, or performed well on 19 out of 20 rollouts and failed catastrophically on one. We propose Reflective Prompted Policy Optimization (R2PO), a two-stage LLM framework for policy search over compact policy classes that augments scalar reward feedback with trajectory-level behavioral evidence. A Search-LLM proposes candidate policy parameters; the environment executes them; a Critic-LLM inspects the resulting rollouts and proposes targeted revisions grounded in observed states, actions, and rewards. Across ten environments, ablations show R2PO's gains require separating global search from behavior-grounded revision and using selection to filter high-variance edits. We further identify a dominant failure mode, salience bias: when presented with multiple rollouts, the Critic-LLM fixates on improving a single failure even when most trajectories succeed. In a three-trajectory variant where the Critic-LLM sees the best, worst, and median rollout, this behavior explains 76.6% of regressions on CartPole. R2PO mitigates this by reasoning over aggregate rollout statistics, median-trajectory selection, and a revision rule. Using a 20B open-weight model, R2PO achieves the highest mean best reward across all ten environments, reaches near-optimal performance substantially earlier (e.g., near-maximum CartPole reward within ~500 episodes), and trains far more stably than both deep RL and prior LLM-based methods. These results show that treating trajectories as first-class in-context evidence, rather than artifacts reduced to scalar returns, changes how even comparatively small LLMs search over policy spaces, enabling them to learn faster, diagnose more precisely, and reliably improve external controllers.

Rahaf Abu Hara, Vaibbhav Murarri, Claudio Zito• 2026

Related benchmarks

Task	Dataset	Result
Reinforcement Learning	MountainCarContinuous v0	Average Agent Reward98.75	65
Reinforcement Learning	cartpole	Average Reward474.7	29
Reinforcement Learning	MountainCar	Avg Episode Reward147.8	25
Reinforcement Learning	Swimmer	Average Returns260.4	24
Reinforcement Learning	Inverted Double Pendulum	Avg Episode Reward254	18
Reinforcement Learning	MountainCar	Maximum Return111	14
Reinforcement Learning	cartpole	Max Return500	13
Reinforcement Learning	Maze Gymnasium	Mean Best Reward0.97	12
Reinforcement Learning	FrozenLake	Reward0.62	12
Reinforcement Learning	InvertedPendulum	Mean Reward1.00e+3	8

Showing 10 of 24 rows

Other info

Follow for update

@wizwand_team Discord