Value-Free Policy Optimization via Reward Partitioning
About
Single-trajectory preference optimization methods learn from datasets of ((prompt, response, reward)) tuples, offering a practical alternative to pairwise preference learning by directly leveraging scalar feedback. Existing approaches such as Direct Reward Optimization (DRO) have demonstrated promising results but rely on value function estimation, introducing additional variance, optimization complexity, and sensitivity to off-policy data. We introduce Reward Partition Optimization (RPO), a simple and scalable reward-driven objective that eliminates the need for value function learning. RPO normalizes rewards through a partition-based formulation estimated directly from prompt-level reward distributions, yielding a stable supervised optimization objective without auxiliary models or reinforcement learning loops. We evaluate RPO across multiple encoder-decoder and decoder-only language models using automatic metrics, LLM-as-a-judge evaluations, and optimization stability analyses. Experimental results show that RPO consistently outperforms strong baselines, including SFT, KTO, and DRO, while producing more aligned, diverse, and less toxic generations.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Instruction Following | AlpacaEval | Win Rate79.5 | 420 | |
| Instruction Following | IFEval | Win Rate81.2 | 36 | |
| Multi-turn conversation | MT-Bench | Win Rate77.1 | 36 | |
| Response Generation | UltraFeedback (val) | BERTScore88.1 | 24 |