Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Reflective Prompted Policy Optimization: Trajectory-Grounded Revision and Salience Bias

About

Existing LLM-based policy optimizers see only scalar rewards: that a policy scored 0.45, but not whether the agent got stuck in a loop, fell into a hole on the third step, or performed well on 19 out of 20 rollouts and failed catastrophically on one. We propose Reflective Prompted Policy Optimization (R2PO), a two-stage LLM framework for policy search over compact policy classes that augments scalar reward feedback with trajectory-level behavioral evidence. A Search-LLM proposes candidate policy parameters; the environment executes them; a Critic-LLM inspects the resulting rollouts and proposes targeted revisions grounded in observed states, actions, and rewards. Across ten environments, ablations show R2PO's gains require separating global search from behavior-grounded revision and using selection to filter high-variance edits. We further identify a dominant failure mode, salience bias: when presented with multiple rollouts, the Critic-LLM fixates on improving a single failure even when most trajectories succeed. In a three-trajectory variant where the Critic-LLM sees the best, worst, and median rollout, this behavior explains 76.6% of regressions on CartPole. R2PO mitigates this by reasoning over aggregate rollout statistics, median-trajectory selection, and a revision rule. Using a 20B open-weight model, R2PO achieves the highest mean best reward across all ten environments, reaches near-optimal performance substantially earlier (e.g., near-maximum CartPole reward within ~500 episodes), and trains far more stably than both deep RL and prior LLM-based methods. These results show that treating trajectories as first-class in-context evidence, rather than artifacts reduced to scalar returns, changes how even comparatively small LLMs search over policy spaces, enabling them to learn faster, diagnose more precisely, and reliably improve external controllers.

Rahaf Abu Hara, Vaibbhav Murarri, Claudio Zito• 2026

Related benchmarks

TaskDatasetResultRank
Reinforcement LearningMountainCarContinuous v0
Average Agent Reward98.75
65
Reinforcement Learningcartpole
Average Reward474.7
29
Reinforcement LearningSwimmer
Average Returns260.4
24
Reinforcement LearningMountainCar
Avg Episode Reward147.8
18
Reinforcement LearningInverted Double Pendulum
Avg Episode Reward254
18
Reinforcement LearningMountainCar
Maximum Return111
14
Reinforcement LearningFrozenLake
Reward0.62
12
Reinforcement Learningcartpole
Max Return500
9
Reinforcement LearningInvertedPendulum
Mean Reward1.00e+3
8
Reinforcement LearningMaze
Mean Reward0.97
8
Showing 10 of 24 rows

Other info

Follow for update