| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Image Classification | Average 11 datasets | Base Accuracy87.55 | 83 | |
| Commonsense Reasoning | Average 7 Commonsense Reasoning Tasks | Avg Accuracy72.04 | 72 | |
| Conformal Inference | Average across 15 datasets (test) | Top-1 Accuracy81.1 | 60 | |
| Few-shot Semantic Segmentation | Average Deepglobe, ISIC, Chest X-Ray, FSS-1000 | mIoU76.9 | 54 | |
| Out-of-Distribution Detection | Average OpenImage-O, Texture, iNaturalist, ImageNet-O | AUROC96.23 | 54 | |
| Hallucination Detection | Average Cross-domain | Mean AUROC0.7874 | 48 | |
| Few-shot Image Classification | Average 11 datasets (test) | Average Accuracy (Few-shot)87.41 | 47 | |
| Question Answering | Average of 5 datasets | Average Score78.9 | 46 | |
| Audio Modeling | Average Bach Counting Blues v1 (test) | SNR32.66 | 46 | |
| Error Detection | Average All shifts (test) | AUC90.99 | 40 | |
| Perceptual Image Restoration | Average across datasets (combined) | PSNR37.6 | 35 | |
| Model Selection | Average | Weighted Kendall's Tau (w)0.47 | 32 | |
| Cardiac Segmentation | Average (ACDC, M&Ms, RandBias, RandGhosting, RandMotion, RandSpike) (out-of-domain) | Dice Score81.25 | 32 | |
| Multiple Choice Question Answering | Average (OBQA, ARC, Riddle, PQA) | Average Accuracy68.31 | 31 | |
| Semantic Segmentation | Average Overall | mIoU51.9 | 28 | |
| Reasoning | Average (MATH, OlympicBench, MinervaMath, AMC2023, GPQA) (test) | Pass@159.83 | 27 | |
| Multi-hop Question Answering | Average MuSiQue, 2wiki, HotpotQA | F1 Score63.02 | 26 | |
| Mathematical Reasoning | Average GSM8k-Aug, GSM-Hard, SVAMP, MultiArith | Accuracy79.3 | 26 | |
| Image Deraining | Average across Deraining Datasets (test) | PSNR34.24 | 26 | |
| Video Understanding | Average VideoMME, LongVideoBench, MLVU | Score55.9 | 25 | |
| Question Answering | Average (TriviaQA, HotpotQA, 2Wiki, Musique, Bamboogle) | EM Accuracy49.12 | 24 | |
| Mathematical Reasoning | Average Six Benchmarks | Accuracy68.66 | 24 | |
| Classification | Average | Base Score87.41 | 24 | |
| Calibration | Average StrategyQA, HotpotQA, NQ, Bamboogle | ECE0.264 | 24 | |
| Multi-hop Question Answering | Average (2WikiMQA, Bamboogle, Frames, MuSiQue) | Accuracy54.8 | 24 |