Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Alternating Reinforcement Learning for Rubric-Based Reward Modeling in Non-Verifiable LLM Post-Training

About

Standard reward models typically predict scalar scores that fail to capture the multifaceted nature of response quality in non-verifiable domains, such as creative writing or open-ended instruction following. To address this limitation, we propose Rubric-ARM, a framework that jointly optimizes a rubric generator and a judge using reinforcement learning from preference feedback. Unlike existing methods that rely on static rubrics or disjoint training pipelines, our approach treats rubric generation as a latent action learned to maximize judgment accuracy. We introduce an alternating optimization strategy to mitigate the non-stationarity of simultaneous updates, providing theoretical analysis that demonstrates how this schedule reduces gradient variance during training. Extensive experiments show that Rubric-ARM achieves state-of-the-art performance among baselines on multiple benchmarks and significantly improves downstream policy alignment in both offline and online reinforcement learning settings.

Ran Xu, Tianci Liu, Zihan Dong, Tony Yu, Ilgee Hong, Carl Yang, Linjun Zhang, Tao Zhao, Haoyu Wang• 2026

Related benchmarks

TaskDatasetResultRank
Instruction FollowingIFEval
Accuracy (IFEval)80.4
89
Reward ModelingRewardBench Focus 2
Accuracy90.3
82
Reward ModelingRewardBench Precise IF 2
Accuracy46.2
70
Reward ModelingHelpSteer 3
Accuracy71.1
62
Reward ModelingRM-Bench Chat
Accuracy69.2
42
Reward ModelingRewardBench Chat
Accuracy90.3
42
Reward ModelingRM-Bench Chat Hard
Accuracy80.7
34
Reward Model EvaluationRewardBench 2
Factuality48.8
21
Reward ModelingPPE-IFEval
Accuracy0.72
18
Instruction FollowingIFBench
Accuracy35
18
Showing 10 of 16 rows

Other info

Follow for update