| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Zero-shot Evaluation | Downstream Tasks Zero-shot | Accuracy76 | 278 | |
| Zero-shot Evaluation | Downstream Tasks MMLU, PIQA, Arc-E, Arc-C, Wino, OpenQA | MMLU77.62 | 218 | |
| Zero-shot Classification | Downstream Tasks Zero-shot (BoolQ, HellaSwag, WinoGrande, ARC-e, ARC-c, PIQA, OBQA) | BoolQ Accuracy81.65 | 87 | |
| Zero-shot Question Answering and Commonsense Reasoning | Zero-shot Downstream Tasks (ARC, HellaSwag, WinoGrande, BoolQ, PiQA) | Accuracy (ARC-E)86 | 45 | |
| Downstream Task Accuracy via Paradigm Routing | Downstream Tasks (test) | Accuracy73.4 | 36 | |
| Mean Performance Evaluation | Downstream Tasks Summary | Average Accuracy61.5 | 36 | |
| Downstream Task | 11 Downstream Tasks Aggregate | Average Accuracy64.6 | 32 | |
| Zero-shot Evaluation | 6 zero-shot downstream tasks | Average Accuracy80.05 | 31 | |
| Zero-shot evaluation | Downstream Tasks PiQA ARC Hellaswag Winogrande BoolQ | PiQA Accuracy (Zero-shot)84.4 | 30 | |
| Zero-shot Learning | 7 Downstream Tasks Avg | Average Score74.1 | 28 | |
| Language Modeling Downstream Evaluation | Downstream tasks Average (test) | Average Score58.02 | 24 | |
| Diverse Language Understanding | 62 downstream tasks | Average Accuracy67.5 | 18 | |
| Zero-shot Evaluation | Downstream tasks average | Avg Zero-shot Accuracy81 | 16 | |
| Multiple Choice Question Answering | 6 downstream tasks (ARC-Challenge, ARC-Easy, HellaSwag, Winogrande, SciQ, PIQA) | ARC-Challenge Accuracy43.6 | 12 | |
| Utility Evaluation | Downstream Tasks | Average Accuracy63.4 | 12 | |
| Question Answering | Downstream Tasks (PiQA, ARC-E, ARC-C, HellaSwag, Winogrande, BoolQ, OBQA, SiQA) Zero-shot | PiQA Accuracy80.63 | 10 | |
| Zero-shot Task Evaluation | 9 Downstream Tasks Utility | Average Accuracy54.7 | 10 | |
| Reasoning | Downstream Tasks PiQA, Arc E, Arc C, HS, WG, BoolQ | PiQA Accuracy80.5 | 9 | |
| Zero-shot Reasoning and Question Answering | Downstream Tasks (ARC-C, ARC-E, H'SWAG, LAM'DA, OPENBKQA, PIQA, W'GRANDE) zero-shot | ARC-C Score32.76 | 9 | |
| Multiple Choice Evaluation | Downstream Tasks (ARC, HellaSwag, PIQA, Winogrande) Standard LLM Eval (val/test) | Average Accuracy56.98 | 9 | |
| Downstream Task Evaluation | 15 Downstream Tasks summary | Median EG2 | 7 | |
| Class-conditioned generation | Average of 8 downstream tasks | FID10.63 | 7 | |
| Predictive Validity Verification | Downstream Tasks (KOLD, HateXplain, etc.) | Average Correlation0.3156 | 6 | |
| Downstream evaluation | 9 downstream tasks | Average Accuracy62.6 | 6 | |
| Zero-shot Downstream Task Evaluation | Downstream Tasks zero-shot (Arc-c, Arc-e, BoolQ, COPA, MMLU, OBQA, PIQA, RTE, Winogrande) | Arc-c49.32 | 6 |