Critique-out-Loud Reward Models

About

Traditionally, reward models used for reinforcement learning from human feedback (RLHF) are trained to directly predict preference scores without leveraging the generation capabilities of the underlying large language model (LLM). This limits the capabilities of reward models as they must reason implicitly about the quality of a response, i.e., preference modeling must be performed in a single forward pass through the model. To enable reward models to reason explicitly about the quality of a response, we introduce Critique-out-Loud (CLoud) reward models. CLoud reward models operate by first generating a natural language critique of the assistant's response that is then used to predict a scalar reward for the quality of the response. We demonstrate the success of CLoud reward models for both Llama-3-8B and 70B base models: compared to classic reward models CLoud reward models improve pairwise preference classification accuracy on RewardBench by 4.65 and 5.84 percentage points for the 8B and 70B base models respectively. Furthermore, CLoud reward models lead to a Pareto improvement for win rate on ArenaHard when used as the scoring model for Best-of-N. Finally, we explore how to exploit the dynamic inference compute capabilities of CLoud reward models by performing self-consistency decoding for reward prediction.

Zachary Ankner, Mansheej Paul, Brandon Cui, Jonathan D. Chang, Prithviraj Ammanabrolu• 2024

Related benchmarks

Task	Dataset	Result
Reward Modeling	RewardBench	Chat Score93.6	216
Reward Modeling	RewardBench	Accuracy82	166
Reward Modeling	RM-Bench	--	137
Reward Modeling	RMB	Accuracy63.4	120
Reward Modeling	RewardBench v1.0 (test)	Average Score0.759	89
Reward Modeling	PPE Correctness	Accuracy62.4	45
Reward Modeling	Aggregate of 7 benchmarks (HelpSteer3, Reward Bench V2, SCAN-HPD, HREF, LitBench, WQ_Arena, WPB)	Overall Accuracy68.7	45
Best-of-N Reranking	Average of 7 benchmarks (including AIME24, LeetCode) (test)	Average Accuracy42.5	42
LLM-as-a-judge evaluation	MT-Bench	Pearson's r0.511	36
LLM-as-a-judge evaluation	FB Bench (Feedback Bench)	Pearson's r0.381	36

Showing 10 of 13 rows

Other info

Follow for update

@wizwand_team Discord