Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

R-Align: Enhancing Generative Reward Models through Rationale-Centric Meta-Judging

About

Reinforcement Learning from Human Feedback (RLHF) remains indispensable for aligning large language models (LLMs) in subjective domains. To enhance robustness, recent work shifts toward Generative Reward Models (GenRMs) that generate rationales before predicting preferences. Yet in GenRM training and evaluation, practice remains outcome-label-only, leaving reasoning quality unchecked. We show that reasoning fidelity-the consistency between a GenRM's preference decision and reference decision rationales-is highly predictive of downstream RLHF outcomes, beyond standard label accuracy. Specifically, we repurpose existing reward-model benchmarks to compute Spurious Correctness (S-Corr)-the fraction of label-correct decisions with rationales misaligned with golden judgments. Our empirical evaluation reveals substantial S-Corr even for competitive GenRMs, and higher S-Corr is associated with policy degeneration under optimization. To improve fidelity, we propose Rationale-Centric Alignment, R-Align, which augments training with gold judgments and explicitly supervises rationale alignment. R-Align reduces S-Corr on RM benchmarks and yields consistent gains in actor performance across STEM, coding, instruction following, and general tasks.

Yanlin Lai, Mitt Huang, Hangyu Guo, Xiangfeng Wang, Haodong Li, Shaoxiong Zhan, Liang Zhao, Chengyuan Yao, Yinmin Zhang, Qi Han, Chun Yuan, Zheng Ge, Xiangyu Zhang, Daxin Jiang• 2026

Related benchmarks

TaskDatasetResultRank
Code GenerationLiveCodeBench
Average Score51.5
68
Reward ModelingHelpSteer 3--
39
General Instruction FollowingArena-Hard v2
Score60.2
23
Reward ModelingRewardBench 2
L-Acc92
20
Reward ModelingPPE-Preference
Accuracy65.7
20
General Instruction FollowingWildBench
Score92.6
19
Overall Language Model EvaluationAggregated Benchmarks STEM Code IF General
Average Score61.7
7
General-purpose BehaviorMultiChallenge
Score55.7
7
STEM ReasoningAIME 2025
Score67.2
7
STEM ReasoningGPQA
Score60.3
7
Showing 10 of 11 rows

Other info

Follow for update