Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Curriculum-RLAIF: Curriculum Alignment with Reinforcement Learning from AI Feedback

About

Reward models trained through Reinforcement Learning from AI Feedback (RLAIF) methods frequently suffer from limited generalizability, which hinders the alignment performance of policy models. This challenge stems from various issues, including distribution shift, preference label noise, and mismatch of overly challenging samples with model capacity. In this paper, we aim to enhance the generalizability of reward models through a data-centric approach, driven by the insight that these issues are inherently intertwined from a uniform perspective of data difficulty. Accordingly, we propose a novel framework, Curriculum-RLAIF, which constructs preference pairs with varying difficulty levels and then produces a specific curriculum for reward model training. Comprehensive experimental results suggest that reward models trained with Curriculum-RLAIF achieve improved generalizability, boosting the alignment performance of policy models by a significant margin without incurring additional inference costs compared to various existing non-curriculum baselines. Further analysis and comparison with alternative strategies highlight the superiority of Curriculum-RLAIF in simplicity, efficiency, and effectiveness.

Jiaye Lin, Mengdi Li, Xufeng Zhao, Wenhao Lu, Peilin Zhao, Stefan Wermter, Di Wang• 2025

Related benchmarks

TaskDatasetResultRank
SummarizationSummarization
Win Rate95
39
HarmlessnessHarmlessness
Average Win Rate96
21
HelpfulnessHelpfulness
Average Win Rate97
21
Preference LabelingAnthropic Harmlessness
Preference Labeling Accuracy77
8
Preference LabelingAnthropic Helpfulness
Preference Labeling Accuracy81
8
Preference LabelingSummarization
Preference Labeling Accuracy89
8
Showing 6 of 6 rows

Other info

Follow for update