| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| General Capability | Aggregate (GPQA-D, GSM8K, HumanEval, MATH-500, MBPP, MMLU-Pro) | Average Accuracy75.9 | 66 | |
| Video Understanding | Aggregate MVBench, LongVideo Bench, MLVU, VideoMME | Average Accuracy100 | 63 | |
| Uncertainty Quantification | Aggregate All 11 Datasets | Mean PRR55.7 | 44 | |
| Average Zero-shot Performance | Aggregate of 8 tasks zero-shot | Accuracy (Zero-Shot Aggregate)76.31 | 35 | |
| Aggregate Tabular Benchmarking | Aggregate | Avg Rank1 | 33 | |
| Mathematical Reasoning | Aggregate AMC23, AIME24, MATH-500, Minerva, Olympiad | Intelligence Per Token (IPT)68.2 | 30 | |
| Aggregate Performance Evaluation | Aggregate 10-Benchmark Suite | Average Score79.9 | 29 | |
| Aggregate LLM Evaluation | Aggregate (MATH, GSM8K, H_Eval, MMLU, CEVAL, CMMLU, BoolQ, CSQA) | Aggregate Accuracy81.41 | 26 | |
| Aggregate Multimodal Evaluation | Aggregate | Relative Performance100 | 25 | |
| Ranking | Aggregate BUS-BRA, GIST514-DB, BreastMNIST, Breast | Average Rank1.5 | 24 | |
| Deepfake Detection | Aggregate All Unseen (test) | mAP98.06 | 24 | |
| Summarization | Aggregate (test) | Comprehensiveness4.97 | 24 | |
| Selective Generation | Aggregate All Datasets | Mean PRR56.3 | 22 | |
| Multimodal Question Answering | Aggregate (Open-WikiTable, 2WikiMQA, InfoSeek, Dyn-VQA, TabFact, WebQA) | Average Score55.93 | 22 | |
| Language Understanding | Aggregate ARC-C, MMLU, HellaSwag, TruthfulQA (test) | Total Score159.215 | 22 | |
| Multi-task Evaluation | Aggregate (AIME25, AIME24, MATH, GSM8K, HumanEval+, MBPP+, MedQA, GPQA-Diamond) | Average Accuracy77.6 | 21 | |
| LLM Inference | Aggregate Mean over Alpaca, CodeAlpaca, HumanEval, LiveCodeBench, Math500, MBPP, MT-Bench | Mean Speedup2.87 | 21 | |
| Text Clustering | Aggregate | Macro Score80 | 21 | |
| Multi-task Evaluation | Aggregate (GSM8K, BFCL, Spider, HumanEval) | Average Accuracy79.4 | 20 | |
| Tool Usage | Aggregate BFCL and Meta-Tool | Accuracy86.6 | 20 | |
| Multi-task Evaluation | Aggregate All tasks (summary) | Score74.6 | 20 | |
| Video Understanding | Aggregate (MVBench, EgoSchema, NExT-QA, VideoMME) | Aggregate Score56.4 | 19 | |
| Watermark Detection and Quality Evaluation | Aggregate (MMLU, HellaSwag, ARC-C, GPQA, MBPP, GSM8K) | TPR@10.99 | 18 | |
| Mathematical Reasoning | Aggregate MATH, Minerva, Olympiad, GAO, AMC23, AIME24, AIME25 | Average Score53.36 | 18 | |
| Zero-shot NLP Evaluation | Aggregate | Average Accuracy60.47 | 18 |