| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Classification | Zero-shot Evaluation Suite (BoolQ, PIQA, HellaSwag, WinoGrande, ARC-e, ARC-c, OBQA) | Average Accuracy (Zero-Shot Suite)69.99 | 94 | |
| Question Answering | Evaluation Suite (ARC, HellaSwag, MMLU) Zero-shot (test) | ARC-C52.9 | 67 | |
| Zero-shot Reasoning | Evaluation Suite Zero-shot (OpenbookQA, ARC-e, ARC-c, WinoGrande, HellaSwag, PIQA, MathQA) | Average Accuracy69.99 | 56 | |
| Image Classification | Zero-shot Evaluation Suite (Food-101, CIFAR-10, CIFAR-100, SUN397, Stanford Cars, FGVC Aircraft, DTD, Oxford-IIIT Pets, Caltech-101, Flowers102, ImageNet-1K) various (test) | Food-101 Top-1 Acc90.5 | 29 | |
| Zero-shot Language Understanding | Evaluation Suite Zero-shot (LMB, HellA, PIQA, ARC-E, ARC-C, WINO, Open, MMLU) | ARC-E Accuracy83.4 | 25 | |
| Language Understanding and Common Sense Reasoning | Zero-shot Evaluation Suite (PIQA, ARC-C, ARC-E, HellaS, WinoG, BoolQ, LAMBADA, C4) | PIQA Accuracy79.67 | 24 | |
| Zero-shot Question Answering and Reasoning | Evaluation Suite Zero-shot (ARC, LogiQA, Wino, CSQA, BoolQ, PIQA, MMLU) | ARC83.88 | 21 | |
| General Language Understanding | 12-task evaluation suite (test) | Average Score71.62 | 20 | |
| Aggregate Multimodal Evaluation | Evaluation Suite Average | Average Score100 | 19 | |
| Question Answering and Commonsense Reasoning | Zero-Shot Evaluation Suite (Arc-e, Arc-c, Boolq, Hellaswag, Openbookqa, Piqa, SciQ, Winogrande) | ARC-E65.74 | 18 | |
| Commonsense Reasoning | Zero-shot Evaluation Suite (HellaSwag, PIQA, Arc-E, Arc-C, WinoGrande, OBQA, SIQA, BoolQ) | HellaSwag (Zero-shot)38.29 | 15 | |
| Large Language Model Evaluation | 12-task evaluation suite composite (test) | Reading Comprehension Score49.6 | 14 | |
| Zero-shot Evaluation | Zero-shot Evaluation Suite | ARCC60.5 | 14 | |
| Zero-shot Classification | Evaluation Suite Zero-shot (PiQA, LAMBDA, ARC-e, ARC-c, HellaS) | Decode Latency5.23 | 12 | |
| Natural Language Understanding | Evaluation Suite Zero-shot (Arc-c, Arc-e, BoolQ, COPA, MMLU, OBQA, PIQA, RTE, Winogrande) | ARC-c56.14 | 12 | |
| Zero-shot Classification | Zero-shot Evaluation Suite (AC, AE, WI, QA) v1 | AC Score46.2 | 10 | |
| Question Answering | Evaluation Suite Zero-shot | Accuracy (Zero-shot)54.3 | 9 | |
| Zero-shot evaluation | Zero-shot Evaluation Suite (ARC, HellaSwag, MMLU, PIQA, WinoGrande) | ARC Challenge Accuracy47.7 | 6 | |
| Zero-shot Reasoning and Knowledge | Zero-shot Evaluation Suite (ARC, HellaSwag, MMLU, PIQA, Winogrande) | ARC-C Acc48.1 | 6 | |
| General Model Capability | Table 3 Evaluation Suite | Average Score70.94 | 4 | |
| Multimodal Understanding | Evaluation Suite Combined (held-out) | Accuracy48.4 | 4 |