| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| LLM Evaluation | Shared (evaluation) | Tie-aware Accuracy78 | 10 | |
| Multi-judge evaluation | Shared 500-prompt sample | Global Correlation (r)0.87 | 5 | |
| Calibration and Discrimination | Shared pooled aggregation (test) | Brier Score (BS)0.1 | 4 |