CE-RM: A Pointwise Generative Reward Model Optimized via Two-Stage Rollout and Unified Criteria

About

Automatic evaluation is crucial yet challenging for open-ended natural language generation, especially when rule-based metrics are infeasible. Compared with traditional methods, the recent LLM-as-a-Judge paradigms enable better and more flexible evaluation, and show promise as generative reward models for reinforcement learning. However, prior work has revealed a notable gap between their seemingly impressive benchmark performance and actual effectiveness in RL practice. We attribute this issue to some limitations in existing studies, including the dominance of pairwise evaluation and inadequate optimization of evaluation criteria. Therefore, we propose CE-RM-4B, a pointwise generative reward model trained with a dedicated two-stage rollout method, and adopting unified query-based criteria. Using only about 5.7K high-quality data curated from the open-source preference dataset, our CE-RM-4B achieves superior performance on diverse reward model benchmarks, especially in Best-of-N scenarios, and delivers more effective improvements in downstream RL practice.

Xinyu Hu, Yancheng He, Weixun Wang, Tao Feng, Li Lin, Jiashun Liu, Wenbo Su, Bo Zheng, Xiaojun Wan• 2026

Related benchmarks

Task	Dataset	Result
Reward Modeling	RM-Bench (test)	Overall Score83.2	63
Reward Modeling	JudgeBench (test)	Overall76.3	40
Reward Modeling	PPE Correctness (test)	PPE Corr75	26
Reward Modeling	RewardBench (test)	RWBench0.9	25
Creative Writing	Arena-Hard Creative Writing v2	Score49.1	25
LLM Evaluation	Arena-Hard v2	Score18.2	14
LLM Evaluation	Arena Hard v0.1	Arena-Hard Score78.3	9
Reward Modeling	RewardBench 2 (test)	RWBench2 Score76.3	9
Reward Modeling	Overall Performance (test)	Overall80.2	9

Showing 9 of 9 rows

Other info

Follow for update

@wizwand_team Discord