Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Semi-Supervised Reward Modeling via Iterative Self-Training

About

Reward models (RM) capture the values and preferences of humans and play a central role in Reinforcement Learning with Human Feedback (RLHF) to align pretrained large language models (LLMs). Traditionally, training these models relies on extensive human-annotated preference data, which poses significant challenges in terms of scalability and cost. To overcome these limitations, we propose Semi-Supervised Reward Modeling (SSRM), an approach that enhances RM training using unlabeled data. Given an unlabeled dataset, SSRM involves three key iterative steps: pseudo-labeling unlabeled examples, selecting high-confidence examples through a confidence threshold, and supervised finetuning on the refined dataset. Across extensive experiments on various model configurations, we demonstrate that SSRM significantly improves reward models without incurring additional labeling costs. Notably, SSRM can achieve performance comparable to models trained entirely on labeled data of equivalent volumes. Overall, SSRM substantially reduces the dependency on large volumes of human-annotated data, thereby decreasing the overall cost and time involved in training effective reward models.

Yifei He, Haoxiang Wang, Ziyan Jiang, Alexandros Papangelis, Han Zhao• 2024

Related benchmarks

TaskDatasetResultRank
AlpacaEval 2.0UltraFeedback
LC16.2
42
MT-BenchUltraFeedback
MT-Bench Score6.3
42
AlpacaEval 2.0UltraMedical Preference
LC13.1
28
MT-BenchDSP Business
MT Score6.8
28
AlpacaEval 2.0DSP Business
LC15.9
28
MT-BenchUltraMedical Preference
MT Score6.4
28
Showing 6 of 6 rows

Other info

Follow for update