| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Video Understanding | Aggregate MVBench, LongVideo Bench, MLVU, VideoMME | Average Score100 | 59 | |
| Average Zero-shot Performance | Aggregate of 8 tasks zero-shot | Accuracy (Zero-Shot Aggregate)76.31 | 35 | |
| Aggregate Tabular Benchmarking | Aggregate | Avg Rank1 | 33 | |
| Aggregate Performance Evaluation | Aggregate 10-Benchmark Suite | Average Score79.9 | 29 | |
| Aggregate Multimodal Evaluation | Aggregate | Relative Performance100 | 25 | |
| Ranking | Aggregate BUS-BRA, GIST514-DB, BreastMNIST, Breast | Average Rank1.5 | 24 | |
| Deepfake Detection | Aggregate All Unseen (test) | mAP98.06 | 24 | |
| Summarization | Aggregate (test) | Comprehensiveness4.97 | 24 | |
| Multimodal Question Answering | Aggregate (Open-WikiTable, 2WikiMQA, InfoSeek, Dyn-VQA, TabFact, WebQA) | Average Score55.93 | 22 | |
| Language Understanding | Aggregate ARC-C, MMLU, HellaSwag, TruthfulQA (test) | Total Score159.215 | 22 | |
| Multi-task Evaluation | Aggregate (AIME25, AIME24, MATH, GSM8K, HumanEval+, MBPP+, MedQA, GPQA-Diamond) | Average Accuracy77.6 | 21 | |
| LLM Inference | Aggregate Mean over Alpaca, CodeAlpaca, HumanEval, LiveCodeBench, Math500, MBPP, MT-Bench | Mean Speedup2.87 | 21 | |
| Text Clustering | Aggregate | Macro Score80 | 21 | |
| Multi-task Evaluation | Aggregate All tasks (summary) | Score74.6 | 20 | |
| Video Understanding | Aggregate (MVBench, EgoSchema, NExT-QA, VideoMME) | Aggregate Score56.4 | 19 | |
| Watermark Detection and Quality Evaluation | Aggregate (MMLU, HellaSwag, ARC-C, GPQA, MBPP, GSM8K) | TPR@10.99 | 18 | |
| Mathematical Reasoning | Aggregate MATH, Minerva, Olympiad, GAO, AMC23, AIME24, AIME25 | Average Score53.36 | 18 | |
| Zero-shot NLP Evaluation | Aggregate | Average Accuracy60.47 | 18 | |
| Agent Interaction | Aggregate FTWP, ScienceWorld, WebShop | Format Faithfulness Rate89.74 | 17 | |
| Safety Evaluation | Aggregate JBB, SR, WJ | Reasoning Average13.9 | 17 | |
| Scientific Reasoning | Aggregate GPQA, HLE, MMLU-Pro | Average Score44.6 | 17 | |
| General Language Modeling Performance | Aggregate AlpacaEval, TruthfulQA, GSM8K, DROP, AGI Eval, BBH, MMLU | Average Score44.7 | 16 | |
| Overall Language Performance | Aggregate All Benchmarks | Average Accuracy62.03 | 16 | |
| Reasoning | Aggregate (AIME’24, AIME’25, MATH 500, AMC’23, GPQA Diamond) (test) | AAI14.64 | 15 | |
| Mathematical Reasoning | Aggregate AIME24, AMC, MATH500, Minerva, Olympiad bench | Accuracy Delta (%)22.7 | 15 |