Value-Free Policy Optimization via Reward Partitioning

About

Single-trajectory preference optimization methods learn from datasets of ((prompt, response, reward)) tuples, offering a practical alternative to pairwise preference learning by directly leveraging scalar feedback. Existing approaches such as Direct Reward Optimization (DRO) have demonstrated promising results but rely on value function estimation, introducing additional variance, optimization complexity, and sensitivity to off-policy data. We introduce Reward Partition Optimization (RPO), a simple and scalable reward-driven objective that eliminates the need for value function learning. RPO normalizes rewards through a partition-based formulation estimated directly from prompt-level reward distributions, yielding a stable supervised optimization objective without auxiliary models or reinforcement learning loops. We evaluate RPO across multiple encoder-decoder and decoder-only language models using automatic metrics, LLM-as-a-judge evaluations, and optimization stability analyses. Experimental results show that RPO consistently outperforms strong baselines, including SFT, KTO, and DRO, while producing more aligned, diverse, and less toxic generations.

Bilal Faye, Hanane Azzag, Mustapha Lebbah• 2025

Related benchmarks

Task	Dataset	Result
Instruction Following	AlpacaEval	Win Rate79.5	423
Instruction Following	IFEval	Win Rate81.2	36
Multi-turn conversation	MT-Bench	Win Rate77.1	36
Response Generation	UltraFeedback (val)	BERTScore88.1	24

Showing 4 of 4 rows

Other info

Follow for update

@wizwand_team Discord