Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Generative Reward Models

About

Reinforcement Learning from Human Feedback (RLHF) has greatly improved the performance of modern Large Language Models (LLMs). The RLHF process is resource-intensive and technically challenging, generally requiring a large collection of human preference labels over model-generated outputs. Reinforcement Learning from AI Feedback (RLAIF) addresses this data collection challenge by leveraging synthetic preferences generated by an LLM. However, recent work has shown that synthetic preferences labels may not align well with human preference judgments. To address this, we propose a hybrid approach that unifies RLHF and RLAIF methodologies. We introduce GenRM, an iterative algorithm that trains an LLM on self-generated reasoning traces, leading to synthetic preference labels matching human preference judgments. Empirically, we show that zero-shot LLM-based judgments under-perform compared to Bradley-Terry reward models on in-distribution tasks (between 9-36%). In contrast, GenRM achieves in-distribution accuracy comparable to Bradley-Terry models, while significantly outperforming them on out-of-distribution tasks (between 10-45%). Moreover, GenRM surpasses the performance of using LLMs as judges on both in-distribution (by 9-31%) and out-of-distribution tasks (by 2- 6%). Our results show that combining the strengths of RLHF and RLAIF offers a promising approach for improving the quality of synthetic preference labels.

Dakota Mahan, Duy Van Phung, Rafael Rafailov, Chase Blagden, Nathan Lile, Louis Castricato, Jan-Philipp Fr\"anken, Chelsea Finn, Alon Albalak• 2024

Related benchmarks

TaskDatasetResultRank
Web-based Agent InteractionWebShop (test)
Success Rate23.81
42
Web-based Agent InteractionWebShop (val)
Success Rate30.94
31
Agent InteractionAggregate FTWP, ScienceWorld, WebShop
Format Faithfulness Rate88.9
17
Agent InteractionFTWP (val)
Success Rate37.77
17
Agent InteractionScienceWorld (val)
Success Rate42.12
17
Agent InteractionFTWP (test)
Success Rate30.5
17
Agent InteractionScienceWorld (test)
Success Rate33.85
17
Showing 7 of 7 rows

Other info

Follow for update