Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Aggregate

Benchmarks

Task NameDataset NameSOTA ResultTrend
Video UnderstandingAggregate MVBench, LongVideo Bench, MLVU, VideoMME
Average Score100
59
Average Zero-shot PerformanceAggregate of 8 tasks zero-shot
Accuracy (Zero-Shot Aggregate)76.31
35
Aggregate Tabular BenchmarkingAggregate
Avg Rank1
33
Aggregate Performance EvaluationAggregate 10-Benchmark Suite
Average Score79.9
29
Aggregate Multimodal EvaluationAggregate
Relative Performance100
25
RankingAggregate BUS-BRA, GIST514-DB, BreastMNIST, Breast
Average Rank1.5
24
Deepfake DetectionAggregate All Unseen (test)
mAP98.06
24
SummarizationAggregate (test)
Comprehensiveness4.97
24
Multimodal Question AnsweringAggregate (Open-WikiTable, 2WikiMQA, InfoSeek, Dyn-VQA, TabFact, WebQA)
Average Score55.93
22
Language UnderstandingAggregate ARC-C, MMLU, HellaSwag, TruthfulQA (test)
Total Score159.215
22
Multi-task EvaluationAggregate (AIME25, AIME24, MATH, GSM8K, HumanEval+, MBPP+, MedQA, GPQA-Diamond)
Average Accuracy77.6
21
LLM InferenceAggregate Mean over Alpaca, CodeAlpaca, HumanEval, LiveCodeBench, Math500, MBPP, MT-Bench
Mean Speedup2.87
21
Text ClusteringAggregate
Macro Score80
21
Multi-task EvaluationAggregate All tasks (summary)
Score74.6
20
Video UnderstandingAggregate (MVBench, EgoSchema, NExT-QA, VideoMME)
Aggregate Score56.4
19
Watermark Detection and Quality EvaluationAggregate (MMLU, HellaSwag, ARC-C, GPQA, MBPP, GSM8K)
TPR@10.99
18
Mathematical ReasoningAggregate MATH, Minerva, Olympiad, GAO, AMC23, AIME24, AIME25
Average Score53.36
18
Zero-shot NLP EvaluationAggregate
Average Accuracy60.47
18
Agent InteractionAggregate FTWP, ScienceWorld, WebShop
Format Faithfulness Rate89.74
17
Safety EvaluationAggregate JBB, SR, WJ
Reasoning Average13.9
17
Scientific ReasoningAggregate GPQA, HLE, MMLU-Pro
Average Score44.6
17
General Language Modeling PerformanceAggregate AlpacaEval, TruthfulQA, GSM8K, DROP, AGI Eval, BBH, MMLU
Average Score44.7
16
Overall Language PerformanceAggregate All Benchmarks
Average Accuracy62.03
16
ReasoningAggregate (AIME’24, AIME’25, MATH 500, AMC’23, GPQA Diamond) (test)
AAI14.64
15
Mathematical ReasoningAggregate AIME24, AMC, MATH500, Minerva, Olympiad bench
Accuracy Delta (%)22.7
15
Showing 25 of 64 rows