| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Aggregate Tabular Benchmarking | Aggregate | Avg Rank1 | 33 | |
| Aggregate Multimodal Evaluation | Aggregate | Relative Performance100 | 25 | |
| Deepfake Detection | Aggregate All Unseen (test) | mAP98.06 | 24 | |
| Summarization | Aggregate (test) | Comprehensiveness4.97 | 24 | |
| Language Understanding | Aggregate ARC-C, MMLU, HellaSwag, TruthfulQA (test) | Total Score159.215 | 22 | |
| Multi-task Evaluation | Aggregate (AIME25, AIME24, MATH, GSM8K, HumanEval+, MBPP+, MedQA, GPQA-Diamond) | Average Accuracy77.6 | 21 | |
| LLM Inference | Aggregate Mean over Alpaca, CodeAlpaca, HumanEval, LiveCodeBench, Math500, MBPP, MT-Bench | Mean Speedup2.87 | 21 | |
| Text Clustering | Aggregate | Macro Score80 | 21 | |
| Multi-task Evaluation | Aggregate All tasks (summary) | Score74.6 | 20 | |
| Watermark Detection and Quality Evaluation | Aggregate (MMLU, HellaSwag, ARC-C, GPQA, MBPP, GSM8K) | TPR@10.99 | 18 | |
| Mathematical Reasoning | Aggregate MATH, Minerva, Olympiad, GAO, AMC23, AIME24, AIME25 | Average Score53.36 | 18 | |
| Zero-shot NLP Evaluation | Aggregate | Average Accuracy60.47 | 18 | |
| Scientific Reasoning | Aggregate GPQA, HLE, MMLU-Pro | Average Score44.6 | 17 | |
| Overall Language Performance | Aggregate All Benchmarks | Average Accuracy62.03 | 16 | |
| Reasoning | Aggregate (AIME’24, AIME’25, MATH 500, AMC’23, GPQA Diamond) (test) | AAI14.64 | 15 | |
| Mathematical Reasoning | Aggregate AIME24, AMC, MATH500, Minerva, Olympiad bench | Accuracy Delta (%)22.7 | 15 | |
| General Robustness and Detection Evaluation | Aggregate (CIFAR-100, CIFAR-100C, SVHN, Places365, Attacks) | Mean Score96 | 14 | |
| Accuracy prediction | Aggregate (multiple datasets and shifts) (test) | MAE0.0639 | 14 | |
| General Performance | Aggregate Across Math, Code, Chat | Speedup4.91 | 12 | |
| Skin Lesion Classification | Aggregate (Derm7pt, PAD, PH2) (out-of-distribution) | Avg ROC-AUC84.11 | 12 | |
| General Reasoning Summary | Aggregate (GSM8K, MATH500, Minerva Math, Olympiad Bench, AIME24, AIME25, GPQA) | Accuracy79.3 | 11 | |
| General Reasoning Generalization | Aggregate MATH, OlympiadBench, GSM8K, BBH, MMLU-CF, LongBench, HotpotQA, MuSiQue | Average Score63.88 | 10 | |
| Multi-task Evaluation | Aggregate (LAMBADA, HellaSwag, PIQA, ARC, WinoGrande) (various) | Avg Accuracy51.9 | 10 | |
| Pointmap Estimation | Aggregate (NuScenes, AV2, Waymo, ONCE, NuPlan) | Average Rank1.8 | 9 | |
| Generalization Reasoning | Aggregate EMMA, VisuLogic, Zebra-CoT | Total Score37.5 | 9 |