Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Value-Free Policy Optimization via Reward Partitioning

About

Single-trajectory preference optimization methods learn from datasets of ((prompt, response, reward)) tuples, offering a practical alternative to pairwise preference learning by directly leveraging scalar feedback. Existing approaches such as Direct Reward Optimization (DRO) have demonstrated promising results but rely on value function estimation, introducing additional variance, optimization complexity, and sensitivity to off-policy data. We introduce Reward Partition Optimization (RPO), a simple and scalable reward-driven objective that eliminates the need for value function learning. RPO normalizes rewards through a partition-based formulation estimated directly from prompt-level reward distributions, yielding a stable supervised optimization objective without auxiliary models or reinforcement learning loops. We evaluate RPO across multiple encoder-decoder and decoder-only language models using automatic metrics, LLM-as-a-judge evaluations, and optimization stability analyses. Experimental results show that RPO consistently outperforms strong baselines, including SFT, KTO, and DRO, while producing more aligned, diverse, and less toxic generations.

Bilal Faye, Hanane Azzag, Mustapha Lebbah• 2025

Related benchmarks

TaskDatasetResultRank
Instruction FollowingAlpacaEval
Win Rate79.5
420
Instruction FollowingIFEval
Win Rate81.2
36
Multi-turn conversationMT-Bench
Win Rate77.1
36
Response GenerationUltraFeedback (val)
BERTScore88.1
24
Showing 4 of 4 rows

Other info

Follow for update