| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Time series forecasting | 3 datasets averaged (test) | MAE0.08 | 22 | |
| Selective Prediction | 3 datasets (mean over all 21 runs) | AUROC0.7895 | 16 | |
| LLM-as-a-Judge Routing | 3 datasets Average (test) | Accuracy90 | 12 | |
| Selective Prediction | 3 datasets (Trace) | AUROC0.7984 | 8 |