| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Reward Modeling | JudgeBench | Accuracy93.3 | 105 | |
| Uncertainty Estimation | JudgeBench (test) | AUROC71.53 | 77 | |
| Reward Modeling | JudgeBench (test) | Overall82 | 40 | |
| LLM-as-a-Judge | JudgeBench | Accuracy84.19 | 29 | |
| Uncertainty Calibration | JudgeBench | Kuiper0.037 | 24 | |
| LLM-as-a-Judge Evaluation | JudgeBench (test) | Score83.4 | 22 | |
| LLM Evaluation | JudgeBench (test) | Knowledge79.9 | 16 | |
| Pair-wise comparison | JudgeBench | Accuracy75.7 | 16 | |
| Reward Modeling | JudgeBench Knowledge | Accuracy74.4 | 16 | |
| Reward Modeling | JudgeBench | Knowledge62.3 | 13 | |
| Preference Prediction | JudgeBench | Positional Consistent Accuracy63.9 | 10 | |
| Reward Modeling | JudgeBench | Positional Consistency Score56.3 | 8 | |
| LLM-as-a-Judge | JudgeBench (Merged GPT Claude) | Direct Baseline Score87.38 | 8 | |
| Model Evaluation | JudgeBench (test) | Kuiper5.63 | 8 |