| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Reward Modeling Evaluation | Reward Bench Factuality 2 | Pairwise Accuracy56.6 | 64 | |
| Reward Modeling | Reward Bench Math | EF0.305 | 52 | |
| Reward Modeling | Reward Bench safety subset response perturbations 2 | LE Score-0.629 | 26 | |
| Reward Modeling | Reward Bench safety subset prompt perturbations 2 | EF-0.18 | 26 | |
| Reward Modeling | Reward Bench V2 | Accuracy83.44 | 22 | |
| Reward Modeling | Reward Bench Prior Sets | Prior Sets Score78.2 | 17 | |
| Reward Model Evaluation | Reward Bench 2 (test) | RB2 Factuality MAE0.451 | 12 | |
| Reward Modeling Evaluation | Reward Bench Ties 2 | Pairwise Accuracy91.8 | 12 | |
| Reward Modeling Evaluation | Reward Bench Safety 2 | Pairwise Accuracy72.3 | 12 | |
| Reward Modeling Evaluation | Reward Bench Math 2 | Pairwise Accuracy72.3 | 12 | |
| Reward Modeling Evaluation | Reward-Bench | Agreement84.79 | 12 |