| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Zero-shot Evaluation | Downstream Tasks Zero-shot | Accuracy76 | 278 | |
| Zero-shot Evaluation | Downstream Tasks MMLU, PIQA, Arc-E, Arc-C, Wino, OpenQA | MMLU77.62 | 218 | |
| Zero-shot Classification | Downstream Tasks Zero-shot (BoolQ, HellaSwag, WinoGrande, ARC-e, ARC-c, PIQA, OBQA) | BoolQ Accuracy81.65 | 87 | |
| Zero-shot Evaluation | 6 zero-shot downstream tasks | Average Accuracy80.05 | 70 | |
| Zero-shot Question Answering and Commonsense Reasoning | Zero-shot Downstream Tasks (ARC, HellaSwag, WinoGrande, BoolQ, PiQA) | Average Accuracy (Zero-Shot)81 | 48 | |
| Downstream Task Accuracy via Paradigm Routing | Downstream Tasks (test) | Accuracy73.4 | 36 | |
| Mean Performance Evaluation | Downstream Tasks Summary | Average Accuracy61.5 | 36 | |
| Downstream Task | 11 Downstream Tasks Aggregate | Average Accuracy64.6 | 32 | |
| Zero-shot classification | Eight downstream tasks zero-shot | Accuracy (Zero-shot)47.7 | 30 | |
| Zero-shot evaluation | Downstream Tasks PiQA ARC Hellaswag Winogrande BoolQ | PiQA Accuracy (Zero-shot)84.4 | 30 | |
| Zero-shot Learning | 7 Downstream Tasks Avg | Average Score74.1 | 28 | |
| Language Modeling Downstream Evaluation | Downstream tasks Average (test) | Average Score58.02 | 24 | |
| Zero-shot Reasoning | Downstream Tasks (LMB, PIQA, HellaSwag, OPQA, ARC) | LAMBADA (LMB) Accuracy32.75 | 22 | |
| Diverse Language Understanding | 62 downstream tasks | Average Accuracy67.5 | 18 | |
| Zero-shot Evaluation | Downstream tasks average | Avg Zero-shot Accuracy81 | 16 | |
| Multiple Choice Question Answering | 6 downstream tasks (ARC-Challenge, ARC-Easy, HellaSwag, Winogrande, SciQ, PIQA) | ARC-Challenge Accuracy43.6 | 12 | |
| Utility Evaluation | Downstream Tasks | Average Accuracy63.4 | 12 | |
| Downstream Task Evaluation | Downstream Tasks Aggregate | Accuracy60.49 | 11 | |
| Question Answering | Downstream Tasks (PiQA, ARC-E, ARC-C, HellaSwag, Winogrande, BoolQ, OBQA, SiQA) Zero-shot | PiQA Accuracy80.63 | 10 | |
| Zero-shot Task Evaluation | 9 Downstream Tasks Utility | Average Accuracy54.7 | 10 | |
| Downstream Task Evaluation | 10 Downstream Tasks | Average Accuracy52.79 | 9 | |
| Reasoning | Downstream Tasks PiQA, Arc E, Arc C, HS, WG, BoolQ | PiQA Accuracy80.5 | 9 | |
| Zero-shot Reasoning and Question Answering | Downstream Tasks (ARC-C, ARC-E, H'SWAG, LAM'DA, OPENBKQA, PIQA, W'GRANDE) zero-shot | ARC-C Score32.76 | 9 | |
| Multiple Choice Evaluation | Downstream Tasks (ARC, HellaSwag, PIQA, Winogrande) Standard LLM Eval (val/test) | Average Accuracy56.98 | 9 | |
| Multiple Choice Question Answering | Downstream Tasks (ARC-E, ARC-C, SciQ, PIQA, MMLU, CMMLU, CEVAL) | Average Accuracy59.21 | 9 |