Polychromic Objectives for Reinforcement Learning

About

Reinforcement learning fine-tuning (RLFT) is a dominant paradigm for improving pretrained policies for downstream tasks. These pretrained policies, trained on large datasets, produce generations with a broad range of promising but unrefined behaviors. Often, a critical failure mode of RLFT arises when policies lose this diversity and collapse into a handful of easily exploitable outputs. This convergence hinders exploration, which is essential for expanding the capabilities of the pretrained policy and for amplifying the benefits of test-time compute scaling. To address this, we introduce an objective for policy gradient methods that explicitly enforces the exploration and refinement of diverse generations, which we call a polychromic objective. We then show how proximal policy optimization (PPO) can be adapted to optimize this objective. Our method (1) employs vine sampling to collect on-policy rollouts and (2) modifies the advantage function to reflect the advantage under our new objective. Experiments on BabyAI, Minigrid, and Algorithmic Creativity show that our method improves success rates by reliably solving a larger set of environment configurations and generalizes better under large perturbations. Moreover, when given multiple attempts in pass@$k$ experiments, the policy achieves substantially higher coverage, demonstrating its ability to maintain and exploit a diverse repertoire of strategies.

Jubayer Ibn Hamid, Ifdita Hasan Orney, Ellen Xu, Chelsea Finn, Dorsa Sadigh• 2025

Related benchmarks

Task	Dataset	Result
Instruction Following	BabyAI BossLevel	Success Rate46.8	14
Bosslevel	BabyAI	Average Pass Rate0.343	7
Four Rooms	MiniGrid	Average Pass Rate88.7	7
Goto	BabyAI	Average Pass Rate0.606	7
Instruction Following	BabyAI Goto	Average Episodic Reward0.575	7
Instruction Following	BabyAI Pickup	Average Episodic Reward0.486	7
Pickup	BabyAI	Average Pass Rate33.4	7
Synthseq	BabyAI	Average Pass Rate32.1	7
Instruction Following	BabyAI Synthseq	Average Episodic Reward0.341	7
Navigation	MiniGrid Four Rooms	Average Episodic Reward0.666	7

Showing 10 of 10 rows

Other info

Follow for update

@wizwand_team Discord