Share your thoughts, 1 month free Claude Pro on usSee more

Aggregate

Benchmarks

Task Name	Dataset Name	SOTA Result
General Capability	Aggregate (GPQA-D, GSM8K, HumanEval, MATH-500, MBPP, MMLU-Pro)	Average Accuracy75.9	66
Video Understanding	Aggregate MVBench, LongVideo Bench, MLVU, VideoMME	Average Accuracy100	63
Uncertainty Quantification	Aggregate All 11 Datasets	Mean PRR55.7	44
Average Zero-shot Performance	Aggregate of 8 tasks zero-shot	Accuracy (Zero-Shot Aggregate)76.31	35
Aggregate Tabular Benchmarking	Aggregate	Avg Rank1	33
Mathematical Reasoning	Aggregate AMC23, AIME24, MATH-500, Minerva, Olympiad	Intelligence Per Token (IPT)68.2	30
Aggregate Performance Evaluation	Aggregate 10-Benchmark Suite	Average Score79.9	29
Aggregate LLM Evaluation	Aggregate (MATH, GSM8K, H_Eval, MMLU, CEVAL, CMMLU, BoolQ, CSQA)	Aggregate Accuracy81.41	26
Aggregate Multimodal Evaluation	Aggregate	Relative Performance100	25
Ranking	Aggregate BUS-BRA, GIST514-DB, BreastMNIST, Breast	Average Rank1.5	24
Deepfake Detection	Aggregate All Unseen (test)	mAP98.06	24
Summarization	Aggregate (test)	Comprehensiveness4.97	24
Selective Generation	Aggregate All Datasets	Mean PRR56.3	22
Multimodal Question Answering	Aggregate (Open-WikiTable, 2WikiMQA, InfoSeek, Dyn-VQA, TabFact, WebQA)	Average Score55.93	22
Language Understanding	Aggregate ARC-C, MMLU, HellaSwag, TruthfulQA (test)	Total Score159.215	22
Multi-task Evaluation	Aggregate (AIME25, AIME24, MATH, GSM8K, HumanEval+, MBPP+, MedQA, GPQA-Diamond)	Average Accuracy77.6	21
LLM Inference	Aggregate Mean over Alpaca, CodeAlpaca, HumanEval, LiveCodeBench, Math500, MBPP, MT-Bench	Mean Speedup2.87	21
Text Clustering	Aggregate	Macro Score80	21
Multi-task Evaluation	Aggregate (GSM8K, BFCL, Spider, HumanEval)	Average Accuracy79.4	20
Tool Usage	Aggregate BFCL and Meta-Tool	Accuracy86.6	20
Multi-task Evaluation	Aggregate All tasks (summary)	Score74.6	20
Video Understanding	Aggregate (MVBench, EgoSchema, NExT-QA, VideoMME)	Aggregate Score56.4	19
Watermark Detection and Quality Evaluation	Aggregate (MMLU, HellaSwag, ARC-C, GPQA, MBPP, GSM8K)	TPR@10.99	18
Mathematical Reasoning	Aggregate MATH, Minerva, Olympiad, GAO, AMC23, AIME24, AIME25	Average Score53.36	18
Zero-shot NLP Evaluation	Aggregate	Average Accuracy60.47	18

Showing 25 of 109 rows