Curriculum-RLAIF: Curriculum Alignment with Reinforcement Learning from AI Feedback

About

Reward models trained through Reinforcement Learning from AI Feedback (RLAIF) methods frequently suffer from limited generalizability, which hinders the alignment performance of policy models. This challenge stems from various issues, including distribution shift, preference label noise, and mismatch of overly challenging samples with model capacity. In this paper, we aim to enhance the generalizability of reward models through a data-centric approach, driven by the insight that these issues are inherently intertwined from a uniform perspective of data difficulty. Accordingly, we propose a novel framework, Curriculum-RLAIF, which constructs preference pairs with varying difficulty levels and then produces a specific curriculum for reward model training. Comprehensive experimental results suggest that reward models trained with Curriculum-RLAIF achieve improved generalizability, boosting the alignment performance of policy models by a significant margin without incurring additional inference costs compared to various existing non-curriculum baselines. Further analysis and comparison with alternative strategies highlight the superiority of Curriculum-RLAIF in simplicity, efficiency, and effectiveness.

Jiaye Lin, Mengdi Li, Xufeng Zhao, Wenhao Lu, Peilin Zhao, Stefan Wermter, Di Wang• 2025

Related benchmarks

Task	Dataset	Result
Summarization	Summarization	Win Rate95	39
Harmlessness	Harmlessness	Average Win Rate96	21
Helpfulness	Helpfulness	Average Win Rate97	21
Preference Labeling	Anthropic Harmlessness	Preference Labeling Accuracy77	8
Preference Labeling	Anthropic Helpfulness	Preference Labeling Accuracy81	8
Preference Labeling	Summarization	Preference Labeling Accuracy89	8

Showing 6 of 6 rows

Other info

Follow for update

@wizwand_team Discord