Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Beyond Length Scaling: Synergizing Breadth and Depth for Generative Reward Models

About

Recent advancements in Generative Reward Models (GRMs) have demonstrated that scaling the length of Chain-of-Thought (CoT) reasoning considerably enhances the reliability of evaluation. However, current works predominantly rely on unstructured length scaling, ignoring the divergent efficacy of different reasoning mechanisms: Breadth-CoT (B-CoT, i.e., multi-dimensional principle coverage) and Depth-CoT (D-CoT, i.e., substantive judgment soundness). To address this, we introduce Mix-GRM, a framework that reconfigures raw rationales into structured B-CoT and D-CoT through a modular synthesis pipeline, subsequently employing Supervised Fine-Tuning (SFT) and Reinforcement Learning with Verifiable Rewards (RLVR) to internalize and optimize these mechanisms. Comprehensive experiments demonstrate that Mix-GRM establishes a new state-of-the-art across five benchmarks, surpassing leading open-source RMs by an average of 8.2\%. Our results reveal a clear divergence in reasoning: B-CoT benefits subjective preference tasks, whereas D-CoT excels in objective correctness tasks. Consequently, misaligning the reasoning mechanism with the task directly degrades performance. Furthermore, we demonstrate that RLVR acts as a switching amplifier, inducing an emergent polarization where the model spontaneously allocates its reasoning style to match task demands. The synthesized data and models are released at \href{https://huggingface.co/collections/DonJoey/mix-grm}{Hugging Face}, and the code is released at \href{https://github.com/Don-Joey/Mix-GRM}{Github}.

Qiyuan Zhang, Yufei Wang, Tianhe Wu, Can Xu, Qingfeng Sun, Kai Zheng, Xue Liu, Chen Ma• 2026

Related benchmarks

TaskDatasetResultRank
Reward ModelingRM-Bench
Accuracy82.7
125
Reward ModelingRMB
Accuracy80.1
120
Reward ModelingRewardBench v2
Accuracy77.5
72
Reward ModelingRewardBench v1
Accuracy91.8
28
Reward ModelingPPE
Accuracy64.8
13
Instruction FollowingInstruction-Following Alpaca-V2 Arena-Hard
Alpaca V2 Score9.2
6
Mathematical ReasoningMathematical Reasoning (GSM8K, MATH, STEM, TABMWP)
GSM8K Accuracy77.6
6
Showing 7 of 7 rows

Other info

Follow for update