| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Reward Modeling | RM-Bench | Accuracy96 | 125 | |
| Reward Modeling | RM-Bench (test) | Overall Score96 | 63 | |
| Reward Modeling Evaluation | RM-Bench | Chat Score75.6 | 55 | |
| Reward Modeling | RM Bench Code | EF0.154 | 52 | |
| Reward Modeling | RM-Bench Chat Hard | Accuracy83.3 | 34 | |
| Reward Modeling | RM-Bench v1.0 (test) | Overall Score74.3 | 29 | |
| Reward Modeling Suitability Evaluation | RM Bench Math | EF-0.077 | 26 | |
| Reward Modeling Suitability Evaluation | RM Bench Safety-accept | EF0.698 | 26 | |
| Reward Model Suitability Audit | RM Bench Chat | EF0.313 | 26 | |
| Reward Modeling | RM-Bench Chat | Accuracy78.5 | 18 | |
| Reward Modeling | RM-Bench Chat subset Normal | Accuracy86 | 16 | |
| Reward Modeling | RM-Bench (full) | Chat Score83 | 11 | |
| Preference Prediction | RM-Bench | Accuracy87.8 | 10 | |
| Reward Modeling | RM-Bench Hard | Accuracy0.697 | 10 | |
| Reward Modeling | RM-Bench Normal | Accuracy80 | 10 | |
| Reward Modeling | RM-Bench Easy | Accuracy92.2 | 10 | |
| Reward Modeling | RM-Bench 1k | Positional Consistency73.5 | 8 |