| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| General Language Understanding | GPT-3 Evaluation Suite (11 tasks: HellaSwag, LAMBADA, TriviaQA, WebQs, Winogrande, PIQA, ARC Challenge, ARC Easy, ANLI R1, ANLI R2, ANLI R3) zero-shot | Average Accuracy33.6 | 4 | |
| Few-shot Language Understanding | GPT-3 Evaluation Suite Few-shot | Accuracy48.1 | 3 | |
| Zero-shot Evaluation | GPT-3 Evaluation Suite (LAMBADA, TriviaQA, WebQs, PIQA, RACE-h, BoolQ) 1.3B various (test val) | Overall Accuracy44.4 | 3 |