| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Reward Modeling | RM-Bench | Average Score89.4 | 53 | |
| Reward Modeling | RM Bench Code | EF0.154 | 52 | |
| Reward Modeling | RM-Bench (test) | Overall Score87.1 | 39 | |
| Reward Modeling | RM-Bench Chat Hard | Accuracy83.3 | 34 | |
| Reward Modeling Suitability Evaluation | RM Bench Math | EF-0.077 | 26 | |
| Reward Modeling Suitability Evaluation | RM Bench Safety-accept | EF0.698 | 26 | |
| Reward Model Suitability Audit | RM Bench Chat | EF0.313 | 26 | |
| Reward Modeling | RM-Bench Chat | Accuracy78.5 | 18 | |
| Reward Modeling | RM-Bench Chat subset Normal | Accuracy86 | 16 | |
| Reward Modeling | RM-Bench (full) | Chat Score83 | 11 | |
| Reward Modeling | RM-Bench Hard | Accuracy0.697 | 10 | |
| Reward Modeling | RM-Bench Normal | Accuracy80 | 10 | |
| Reward Modeling | RM-Bench Easy | Accuracy92.2 | 10 | |
| Reward Modeling | RM-Bench v1.0 (test) | Chat Score71.23 | 5 |