Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Aggregate

Benchmarks

Task NameDataset NameSOTA ResultTrend
General CapabilityAggregate (GPQA-D, GSM8K, HumanEval, MATH-500, MBPP, MMLU-Pro)
Average Accuracy75.9
66
Video UnderstandingAggregate MVBench, LongVideo Bench, MLVU, VideoMME
Average Accuracy100
63
Uncertainty QuantificationAggregate All 11 Datasets
Mean PRR55.7
44
Average Zero-shot PerformanceAggregate of 8 tasks zero-shot
Accuracy (Zero-Shot Aggregate)76.31
35
Aggregate Tabular BenchmarkingAggregate
Avg Rank1
33
Mathematical ReasoningAggregate AMC23, AIME24, MATH-500, Minerva, Olympiad
Intelligence Per Token (IPT)68.2
30
Aggregate Performance EvaluationAggregate 10-Benchmark Suite
Average Score79.9
29
Aggregate LLM EvaluationAggregate (MATH, GSM8K, H_Eval, MMLU, CEVAL, CMMLU, BoolQ, CSQA)
Aggregate Accuracy81.41
26
Aggregate Multimodal EvaluationAggregate
Relative Performance100
25
RankingAggregate BUS-BRA, GIST514-DB, BreastMNIST, Breast
Average Rank1.5
24
Deepfake DetectionAggregate All Unseen (test)
mAP98.06
24
SummarizationAggregate (test)
Comprehensiveness4.97
24
Selective GenerationAggregate All Datasets
Mean PRR56.3
22
Multimodal Question AnsweringAggregate (Open-WikiTable, 2WikiMQA, InfoSeek, Dyn-VQA, TabFact, WebQA)
Average Score55.93
22
Language UnderstandingAggregate ARC-C, MMLU, HellaSwag, TruthfulQA (test)
Total Score159.215
22
Multi-task EvaluationAggregate (AIME25, AIME24, MATH, GSM8K, HumanEval+, MBPP+, MedQA, GPQA-Diamond)
Average Accuracy77.6
21
LLM InferenceAggregate Mean over Alpaca, CodeAlpaca, HumanEval, LiveCodeBench, Math500, MBPP, MT-Bench
Mean Speedup2.87
21
Text ClusteringAggregate
Macro Score80
21
Multi-task EvaluationAggregate (GSM8K, BFCL, Spider, HumanEval)
Average Accuracy79.4
20
Tool UsageAggregate BFCL and Meta-Tool
Accuracy86.6
20
Multi-task EvaluationAggregate All tasks (summary)
Score74.6
20
Video UnderstandingAggregate (MVBench, EgoSchema, NExT-QA, VideoMME)
Aggregate Score56.4
19
Watermark Detection and Quality EvaluationAggregate (MMLU, HellaSwag, ARC-C, GPQA, MBPP, GSM8K)
TPR@10.99
18
Mathematical ReasoningAggregate MATH, Minerva, Olympiad, GAO, AMC23, AIME24, AIME25
Average Score53.36
18
Zero-shot NLP EvaluationAggregate
Average Accuracy60.47
18
Showing 25 of 109 rows