| RewardBench | | Avg Score95.1 | | 118 | 4d ago |
| RewardBench Focus 2 | Rubric-ARM-voting@5 | Accuracy90.3 | | 82 | 4d ago |
| RewardBench Precise IF 2 | | Accuracy57.5 | | 70 | 4d ago |
| RewardBench | | Accuracy95.1 | | 70 | 4d ago |
| RM-Bench | OpenRS | Average Score89.4 | | 53 | 4d ago |
| RewardBench Average 2 | FLIP | Accuracy39.7 | | 52 | 4d ago |
| RewardBench Math 2 | Pointwise Rating | Accuracy35.7 | | 52 | 4d ago |
| RM Bench Code | Skywork-Reward-Gemma-2-27B | EF0.154 | | 52 | 4d ago |
| Reward Bench Math | internlm2-20b-reward | EF0.305 | | 52 | 4d ago |
| Aggregate of 7 benchmarks (HelpSteer3, Reward Bench V2, SCAN-HPD, HREF, LitBench, WQ_Arena, WPB) | | Overall Accuracy74.56 | | 45 | 4d ago |
| JudgeBench | OpenRS | Accuracy93.3 | | 45 | 4d ago |
| Unified Feedback (UF) | GRM-SFT | Accuracy78.9 | | 40 | 4d ago |
| JudgeBench (test) | Qwen3-30B-A3B | Overall82 | | 40 | 4d ago |
| RM-Bench (test) | Qwen3-30B-A3B | Overall Score87.1 | | 39 | 4d ago |
| HelpSteer 3 | | Accuracy83.15 | | 39 | 4d ago |
| RM-Bench Chat Hard | | Accuracy83.3 | | 34 | 4d ago |
| PPE Correctness | OPRM-RgFT-Qwen2.5-32B | Accuracy67.3 | | 33 | 4d ago |
| Meta-World Open door | LTCNtext | Prediction Accuracy65.46 | | 28 | 4d ago |
| Meta-World Open drawer | LTriplet | Prediction Accuracy69.01 | | 28 | 4d ago |
| Meta-World Button press | LTriplet | Prediction Accuracy76.44 | | 28 | 4d ago |
| RewardBench v1.0 (test) | BT + margin | Chat Score0.9777 | | 27 | 4d ago |
| Reward Bench safety subset response perturbations 2 | Llama3-8B-IDRM | LE Score-0.629 | | 26 | 4d ago |
| Reward Bench safety subset prompt perturbations 2 | Llama-3-OffsetBias-RM-8B | EF-0.18 | | 26 | 4d ago |
| PPE Correctness (test) | CE-RM-4B | PPE Corr75 | | 26 | 4d ago |
| RewardBench (test) | J1-Llama-70B | RWBench0.933 | | 25 | 4d ago |