Semi-Supervised Reward Modeling via Iterative Self-Training

About

Reward models (RM) capture the values and preferences of humans and play a central role in Reinforcement Learning with Human Feedback (RLHF) to align pretrained large language models (LLMs). Traditionally, training these models relies on extensive human-annotated preference data, which poses significant challenges in terms of scalability and cost. To overcome these limitations, we propose Semi-Supervised Reward Modeling (SSRM), an approach that enhances RM training using unlabeled data. Given an unlabeled dataset, SSRM involves three key iterative steps: pseudo-labeling unlabeled examples, selecting high-confidence examples through a confidence threshold, and supervised finetuning on the refined dataset. Across extensive experiments on various model configurations, we demonstrate that SSRM significantly improves reward models without incurring additional labeling costs. Notably, SSRM can achieve performance comparable to models trained entirely on labeled data of equivalent volumes. Overall, SSRM substantially reduces the dependency on large volumes of human-annotated data, thereby decreasing the overall cost and time involved in training effective reward models.

Yifei He, Haoxiang Wang, Ziyan Jiang, Alexandros Papangelis, Han Zhao• 2024

Related benchmarks

Task	Dataset	Result
AlpacaEval 2.0	UltraFeedback	LC16.2	42
MT-Bench	UltraFeedback	MT-Bench Score6.3	42
AlpacaEval 2.0	UltraMedical Preference	LC13.1	28
MT-Bench	DSP Business	MT Score6.8	28
AlpacaEval 2.0	DSP Business	LC15.9	28
MT-Bench	UltraMedical Preference	MT Score6.4	28

Showing 6 of 6 rows

Other info

Follow for update

@wizwand_team Discord