Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Evaluation Suite

Benchmarks

Task NameDataset NameSOTA ResultTrend
Question AnsweringEvaluation Suite (ARC, HellaSwag, MMLU) Zero-shot (test)
ARC-C52.9
67
ClassificationZero-shot Evaluation Suite (BoolQ, PIQA, HellaSwag, WinoGrande, ARC-e, ARC-c, OBQA)
Average Accuracy (Zero-Shot Suite)69.31
59
Image ClassificationZero-shot Evaluation Suite (Food-101, CIFAR-10, CIFAR-100, SUN397, Stanford Cars, FGVC Aircraft, DTD, Oxford-IIIT Pets, Caltech-101, Flowers102, ImageNet-1K) various (test)
Food-101 Top-1 Acc90.5
29
Zero-shot Language UnderstandingEvaluation Suite Zero-shot (LMB, HellA, PIQA, ARC-E, ARC-C, WINO, Open, MMLU)
ARC-E Accuracy83.4
25
Zero-shot ReasoningEvaluation Suite Zero-shot (OpenbookQA, ARC-e, ARC-c, WinoGrande, HellaSwag, PIQA, MathQA)
Average Accuracy54.92
24
Zero-shot Question Answering and ReasoningEvaluation Suite Zero-shot (ARC, LogiQA, Wino, CSQA, BoolQ, PIQA, MMLU)
ARC83.88
21
General Language Understanding12-task evaluation suite (test)
Average Score71.62
20
Question Answering and Commonsense ReasoningZero-Shot Evaluation Suite (Arc-e, Arc-c, Boolq, Hellaswag, Openbookqa, Piqa, SciQ, Winogrande)
ARC-E65.74
18
Large Language Model Evaluation12-task evaluation suite composite (test)
Reading Comprehension Score49.6
14
Zero-shot EvaluationZero-shot Evaluation Suite
ARCC60.5
14
Natural Language UnderstandingEvaluation Suite Zero-shot (Arc-c, Arc-e, BoolQ, COPA, MMLU, OBQA, PIQA, RTE, Winogrande)
ARC-c56.14
12
Zero-shot ClassificationZero-shot Evaluation Suite (AC, AE, WI, QA) v1
AC Score46.2
10
Multimodal UnderstandingEvaluation Suite Combined (held-out)
Accuracy48.4
4
Showing 13 of 13 rows