Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Evaluation Suite

Benchmarks

Task NameDataset NameSOTA ResultTrend
ClassificationZero-shot Evaluation Suite (BoolQ, PIQA, HellaSwag, WinoGrande, ARC-e, ARC-c, OBQA)
Average Accuracy (Zero-Shot Suite)69.99
94
Question AnsweringEvaluation Suite (ARC, HellaSwag, MMLU) Zero-shot (test)
ARC-C52.9
67
Zero-shot ReasoningEvaluation Suite Zero-shot (OpenbookQA, ARC-e, ARC-c, WinoGrande, HellaSwag, PIQA, MathQA)
Average Accuracy69.99
56
Image ClassificationZero-shot Evaluation Suite (Food-101, CIFAR-10, CIFAR-100, SUN397, Stanford Cars, FGVC Aircraft, DTD, Oxford-IIIT Pets, Caltech-101, Flowers102, ImageNet-1K) various (test)
Food-101 Top-1 Acc90.5
29
Zero-shot Language UnderstandingEvaluation Suite Zero-shot (LMB, HellA, PIQA, ARC-E, ARC-C, WINO, Open, MMLU)
ARC-E Accuracy83.4
25
Language Understanding and Common Sense ReasoningZero-shot Evaluation Suite (PIQA, ARC-C, ARC-E, HellaS, WinoG, BoolQ, LAMBADA, C4)
PIQA Accuracy79.67
24
Zero-shot Question Answering and ReasoningEvaluation Suite Zero-shot (ARC, LogiQA, Wino, CSQA, BoolQ, PIQA, MMLU)
ARC83.88
21
General Language Understanding12-task evaluation suite (test)
Average Score71.62
20
Aggregate Multimodal EvaluationEvaluation Suite Average
Average Score100
19
Question Answering and Commonsense ReasoningZero-Shot Evaluation Suite (Arc-e, Arc-c, Boolq, Hellaswag, Openbookqa, Piqa, SciQ, Winogrande)
ARC-E65.74
18
Commonsense ReasoningZero-shot Evaluation Suite (HellaSwag, PIQA, Arc-E, Arc-C, WinoGrande, OBQA, SIQA, BoolQ)
HellaSwag (Zero-shot)38.29
15
Large Language Model Evaluation12-task evaluation suite composite (test)
Reading Comprehension Score49.6
14
Zero-shot EvaluationZero-shot Evaluation Suite
ARCC60.5
14
Zero-shot ClassificationEvaluation Suite Zero-shot (PiQA, LAMBDA, ARC-e, ARC-c, HellaS)
Decode Latency5.23
12
Natural Language UnderstandingEvaluation Suite Zero-shot (Arc-c, Arc-e, BoolQ, COPA, MMLU, OBQA, PIQA, RTE, Winogrande)
ARC-c56.14
12
Zero-shot ClassificationZero-shot Evaluation Suite (AC, AE, WI, QA) v1
AC Score46.2
10
Question AnsweringEvaluation Suite Zero-shot
Accuracy (Zero-shot)54.3
9
Zero-shot evaluationZero-shot Evaluation Suite (ARC, HellaSwag, MMLU, PIQA, WinoGrande)
ARC Challenge Accuracy47.7
6
Zero-shot Reasoning and KnowledgeZero-shot Evaluation Suite (ARC, HellaSwag, MMLU, PIQA, Winogrande)
ARC-C Acc48.1
6
General Model CapabilityTable 3 Evaluation Suite
Average Score70.94
4
Multimodal UnderstandingEvaluation Suite Combined (held-out)
Accuracy48.4
4
Showing 21 of 21 rows