RUBRIC-ARROW: Alternating Pointwise Rubric Reward Modeling for LLM Post-training in Non-verifiable Domains
About
Pointwise reward modeling offers critical signals for LLM post-training, yet struggles with absolute scoring in subjective, non-verifiable settings. Rubric-based methods address this by decomposing evaluation into explicit criteria, but existing approaches typically depend on frontier LLMs and suffer from ties caused by hard Boolean aggregation. We present RUBRIC-ARROW, an alternating framework that jointly trains a rubric generator and a rubric-conditioned judge, with its RL stage using only pairwise preference data. Our method couples a probability-based scoring rule that reduces ties with phase-specific preference-based rewards and an alternating GRPO scheme that together train the pointwise evaluator. Extensive experiments show that RUBRIC-ARROW achieves competitive reward-modeling accuracy and yields consistent gains for downstream policy post-training.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Instruction Following | AlpacaEval | -- | 420 | |
| Instruction Following | FollowBench | -- | 85 | |
| Reward Modeling | HelpSteer 3 | Accuracy72 | 62 | |
| Instruction Following | IFEval | Avg. Score (IFEval)80.7 | 45 | |
| Reward Modeling | RewardBench Chat | Accuracy90.8 | 42 | |
| Reward Modeling | RM-Bench Chat | Accuracy68.6 | 42 | |
| Reward Modeling | RewardBench 2 | Precise IF Score45 | 41 | |
| Instruction Following | AlpacaEval Length-controlled | Score45.7 | 34 | |
| Instruction Following Evaluation | PPE-IFEval | Score76 | 24 | |
| Instruction Following Evaluation | IFBench | Score73.2 | 23 |