Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Aggregate

Benchmarks

Task NameDataset NameSOTA ResultTrend
Aggregate Tabular BenchmarkingAggregate
Avg Rank1
33
Aggregate Multimodal EvaluationAggregate
Relative Performance100
25
Deepfake DetectionAggregate All Unseen (test)
mAP98.06
24
SummarizationAggregate (test)
Comprehensiveness4.97
24
Language UnderstandingAggregate ARC-C, MMLU, HellaSwag, TruthfulQA (test)
Total Score159.215
22
Multi-task EvaluationAggregate (AIME25, AIME24, MATH, GSM8K, HumanEval+, MBPP+, MedQA, GPQA-Diamond)
Average Accuracy77.6
21
LLM InferenceAggregate Mean over Alpaca, CodeAlpaca, HumanEval, LiveCodeBench, Math500, MBPP, MT-Bench
Mean Speedup2.87
21
Text ClusteringAggregate
Macro Score80
21
Multi-task EvaluationAggregate All tasks (summary)
Score74.6
20
Watermark Detection and Quality EvaluationAggregate (MMLU, HellaSwag, ARC-C, GPQA, MBPP, GSM8K)
TPR@10.99
18
Mathematical ReasoningAggregate MATH, Minerva, Olympiad, GAO, AMC23, AIME24, AIME25
Average Score53.36
18
Zero-shot NLP EvaluationAggregate
Average Accuracy60.47
18
Scientific ReasoningAggregate GPQA, HLE, MMLU-Pro
Average Score44.6
17
Overall Language PerformanceAggregate All Benchmarks
Average Accuracy62.03
16
ReasoningAggregate (AIME’24, AIME’25, MATH 500, AMC’23, GPQA Diamond) (test)
AAI14.64
15
Mathematical ReasoningAggregate AIME24, AMC, MATH500, Minerva, Olympiad bench
Accuracy Delta (%)22.7
15
General Robustness and Detection EvaluationAggregate (CIFAR-100, CIFAR-100C, SVHN, Places365, Attacks)
Mean Score96
14
Accuracy predictionAggregate (multiple datasets and shifts) (test)
MAE0.0639
14
General PerformanceAggregate Across Math, Code, Chat
Speedup4.91
12
Skin Lesion ClassificationAggregate (Derm7pt, PAD, PH2) (out-of-distribution)
Avg ROC-AUC84.11
12
General Reasoning SummaryAggregate (GSM8K, MATH500, Minerva Math, Olympiad Bench, AIME24, AIME25, GPQA)
Accuracy79.3
11
General Reasoning GeneralizationAggregate MATH, OlympiadBench, GSM8K, BBH, MMLU-CF, LongBench, HotpotQA, MuSiQue
Average Score63.88
10
Multi-task EvaluationAggregate (LAMBADA, HellaSwag, PIQA, ARC, WinoGrande) (various)
Avg Accuracy51.9
10
Pointmap EstimationAggregate (NuScenes, AV2, Waymo, ONCE, NuPlan)
Average Rank1.8
9
Generalization ReasoningAggregate EMMA, VisuLogic, Zebra-CoT
Total Score37.5
9
Showing 25 of 40 rows