| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Question Answering | Evaluation Suite (ARC, HellaSwag, MMLU) Zero-shot (test) | ARC-C52.9 | 67 | |
| Classification | Zero-shot Evaluation Suite (BoolQ, PIQA, HellaSwag, WinoGrande, ARC-e, ARC-c, OBQA) | Average Accuracy (Zero-Shot Suite)69.31 | 59 | |
| Image Classification | Zero-shot Evaluation Suite (Food-101, CIFAR-10, CIFAR-100, SUN397, Stanford Cars, FGVC Aircraft, DTD, Oxford-IIIT Pets, Caltech-101, Flowers102, ImageNet-1K) various (test) | Food-101 Top-1 Acc90.5 | 29 | |
| Zero-shot Language Understanding | Evaluation Suite Zero-shot (LMB, HellA, PIQA, ARC-E, ARC-C, WINO, Open, MMLU) | ARC-E Accuracy83.4 | 25 | |
| Zero-shot Reasoning | Evaluation Suite Zero-shot (OpenbookQA, ARC-e, ARC-c, WinoGrande, HellaSwag, PIQA, MathQA) | Average Accuracy54.92 | 24 | |
| Zero-shot Question Answering and Reasoning | Evaluation Suite Zero-shot (ARC, LogiQA, Wino, CSQA, BoolQ, PIQA, MMLU) | ARC83.88 | 21 | |
| General Language Understanding | 12-task evaluation suite (test) | Average Score71.62 | 20 | |
| Question Answering and Commonsense Reasoning | Zero-Shot Evaluation Suite (Arc-e, Arc-c, Boolq, Hellaswag, Openbookqa, Piqa, SciQ, Winogrande) | ARC-E65.74 | 18 | |
| Large Language Model Evaluation | 12-task evaluation suite composite (test) | Reading Comprehension Score49.6 | 14 | |
| Zero-shot Evaluation | Zero-shot Evaluation Suite | ARCC60.5 | 14 | |
| Natural Language Understanding | Evaluation Suite Zero-shot (Arc-c, Arc-e, BoolQ, COPA, MMLU, OBQA, PIQA, RTE, Winogrande) | ARC-c56.14 | 12 | |
| Zero-shot Classification | Zero-shot Evaluation Suite (AC, AE, WI, QA) v1 | AC Score46.2 | 10 | |
| Multimodal Understanding | Evaluation Suite Combined (held-out) | Accuracy48.4 | 4 |