| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Conformal Inference | Average across 15 datasets (test) | Top-1 Accuracy81.1 | 60 | |
| Out-of-Distribution Detection | Average OpenImage-O, Texture, iNaturalist, ImageNet-O | AUROC96.23 | 54 | |
| Image Classification | Average 11 datasets | Base Accuracy87.55 | 52 | |
| Hallucination Detection | Average Cross-domain | Mean AUROC0.7874 | 48 | |
| Commonsense Reasoning | Average 7 Commonsense Reasoning Tasks | Avg Accuracy72.04 | 47 | |
| Audio Modeling | Average Bach Counting Blues v1 (test) | SNR32.66 | 46 | |
| Error Detection | Average All shifts (test) | AUC90.99 | 40 | |
| Few-shot Semantic Segmentation | Average Deepglobe, ISIC, Chest X-Ray, FSS-1000 | mIoU76.9 | 32 | |
| Cardiac Segmentation | Average (ACDC, M&Ms, RandBias, RandGhosting, RandMotion, RandSpike) (out-of-domain) | Dice Score81.25 | 32 | |
| Multiple Choice Question Answering | Average (OBQA, ARC, Riddle, PQA) | Average Accuracy68.31 | 31 | |
| Perceptual Image Restoration | Average across datasets (combined) | PSNR31.28 | 27 | |
| Mathematical Reasoning | Average GSM8k-Aug, GSM-Hard, SVAMP, MultiArith | Accuracy79.3 | 26 | |
| Image Deraining | Average across Deraining Datasets (test) | PSNR34.24 | 26 | |
| Video Understanding | Average VideoMME, LongVideoBench, MLVU | Score55.9 | 25 | |
| Mathematical Reasoning | Average Six Benchmarks | Accuracy68.66 | 24 | |
| Calibration | Average StrategyQA, HotpotQA, NQ, Bamboogle | ECE0.264 | 24 | |
| Multi-hop Question Answering | Average (2WikiMQA, Bamboogle, Frames, MuSiQue) | Accuracy54.8 | 24 | |
| Instruction Following Evaluation | Average (Vicuna, Self-instruct, Dolly, BPO) (test) | Delta Win Rate (ΔWR)22 | 24 | |
| Aggregate NLP Tasks | Average (Emotion, Irony, Stance, MRPC, RTE) | Delta 12.93 | 24 | |
| Continual routing | Average | Accuracy75.2 | 22 | |
| Image Deraining | Average Test100, Test1200, Rain100H, Rain100L (test) | PSNR34.59 | 21 | |
| Video Reconstruction | Average | PSNR27.73 | 21 | |
| Aggregated Performance | Average 10 Tasks | Average Accuracy100.6 | 19 | |
| Question Answering | Average 10K context | Accuracy74.6 | 19 | |
| OOD Detection | Average | FPR@9526.47 | 19 |