| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Image Classification | Average 11 datasets | Base Accuracy87.55 | 95 | |
| Commonsense Reasoning | Average 7 Commonsense Reasoning Tasks | Avg Accuracy72.04 | 72 | |
| Conformal Inference | Average across 15 datasets (test) | Top-1 Accuracy81.1 | 60 | |
| Few-shot Semantic Segmentation | Average Deepglobe, ISIC, Chest X-Ray, FSS-1000 | mIoU76.9 | 54 | |
| Out-of-Distribution Detection | Average OpenImage-O, Texture, iNaturalist, ImageNet-O | AUROC96.23 | 54 | |
| Multi-hop Question Answering | Average (MuSiQue, HotpotQA, 2WikiMultiHopQA, LongSeal) | Average QA Score46.49 | 50 | |
| Hallucination Detection | Average Cross-domain | Mean AUROC0.7874 | 48 | |
| Few-shot Image Classification | Average 11 datasets (test) | Average Accuracy (Few-shot)87.41 | 47 | |
| Semantic Segmentation | Average Overall | mIoU64.3 | 46 | |
| Question Answering | Average of 5 datasets | Average Score78.9 | 46 | |
| Audio Modeling | Average Bach Counting Blues v1 (test) | SNR32.66 | 46 | |
| Error Detection | Average All shifts (test) | AUC90.99 | 40 | |
| Math and Science Reasoning | Average | Accuracy89.7 | 36 | |
| SOC prediction | Average (FTP-75 and PDMHC) | MAE0.4118 | 35 | |
| Perceptual Image Restoration | Average across datasets (combined) | PSNR37.6 | 35 | |
| Continual Category Discovery | Average fine-grained | cACC (All)81.49 | 32 | |
| Model Selection | Average | Weighted Kendall's Tau (w)0.47 | 32 | |
| Cardiac Segmentation | Average (ACDC, M&Ms, RandBias, RandGhosting, RandMotion, RandSpike) (out-of-domain) | Dice Score81.25 | 32 | |
| Multiple Choice Question Answering | Average (OBQA, ARC, Riddle, PQA) | Average Accuracy68.31 | 31 | |
| Mathematical Reasoning | Average GSM8k-Aug, GSM-Hard, SVAMP, MultiArith | Avg Length0 | 31 | |
| OOD Detection | Average | FPR@9526.47 | 31 | |
| General Evaluation | Average Across all benchmarks | Speedup6.9 | 28 | |
| Mathematical Reasoning | Average GSM8k, GSM-sym, MATH, NuminaMath, AIME | Accuracy71.41 | 28 | |
| Reasoning | Average (MATH, OlympicBench, MinervaMath, AMC2023, GPQA) (test) | Pass@159.83 | 27 | |
| Multi-hop Question Answering | Average MuSiQue, 2wiki, HotpotQA | EM43.3 | 27 |