ConsistRM: Improving Generative Reward Models via Consistency-Aware Self-Training

About

Generative reward models (GRMs) have emerged as a promising approach for aligning Large Language Models (LLMs) with human preferences by offering greater representational capacity and flexibility than traditional scalar reward models. However, GRMs face two major challenges: reliance on costly human-annotated data restricts scalability, and self-training approaches often suffer from instability and vulnerability to reward hacking. To address these issues, we propose ConsistRM, a self-training framework that enables effective and stable GRM training without human annotations. ConsistRM incorporates the Consistency-Aware Answer Reward, which produces reliable pseudo-labels with temporal consistency, thereby providing more stable model optimization. Moreover, the Consistency-Aware Critique Reward is introduced to assess semantic consistency across multiple critiques and allocates fine-grained and differentiated rewards. Experiments on five benchmark datasets across four base models demonstrate that ConsistRM outperforms vanilla Reinforcement Fine-Tuning (RFT) by an average of 1.5%. Further analysis shows that ConsistRM enhances output consistency and mitigates position bias caused by input order, highlighting the effectiveness of consistency-aware rewards in improving GRMs.

Yu Liang, Liangxin Liu, Longzheng Wang, Yan Wang, Yueyang Zhang, Long Xia, Zhiyuan Sun, Daiting Shi• 2026

Related benchmarks

Task	Dataset	Result
Reward Modeling	RewardBench	Accuracy85.6	166
Reward Modeling	RM-Bench	Accuracy78.3	125
Reward Modeling	RMB	Accuracy79.1	120
Reward Modeling	JudgeBench	Accuracy56.9	105
Reward Modeling	PPE Pref	Accuracy67.7	15
Reward Modeling	Overall 5-Benchmark Suite	Average Score73.5	12

Showing 6 of 6 rows

Other info

Follow for update

@wizwand_team Discord