| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Uncertainty Estimation | JudgeBench (test) | AUROC71.53 | 77 | |
| Reward Modeling | JudgeBench | Accuracy93.3 | 45 | |
| Reward Modeling | JudgeBench (test) | Overall82 | 40 | |
| Uncertainty Calibration | JudgeBench | Kuiper0.037 | 24 | |
| Pair-wise comparison | JudgeBench | Accuracy75.7 | 16 | |
| Reward Modeling | JudgeBench Knowledge | Accuracy74.4 | 16 | |
| Reward Modeling | JudgeBench | Knowledge62.3 | 13 | |
| Model Evaluation | JudgeBench (test) | Kuiper5.63 | 8 | |
| LLM-as-a-Judge | JudgeBench | Accuracy84.19 | 8 |