| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Reward Modeling | PPE Correctness | Accuracy67.3 | 33 | |
| Preference Calibration | PPE | Kuiper0.034 | 24 | |
| Correctness Calibration | PPE (Preference Policy Evaluation) | Kuiper0.017 | 24 | |
| Reward Modeling | PPE-P | Accuracy68.3 | 23 | |
| Reward Modeling | PPE-Preference | Accuracy69.8 | 20 | |
| Reward Modeling | PPE Human | Accuracy64.6 | 10 |