| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Zero-shot Evaluation | Downstream Tasks Zero-shot | Accuracy76 | 278 | |
| Zero-shot Evaluation | Downstream Tasks MMLU, PIQA, Arc-E, Arc-C, Wino, OpenQA | MMLU77.62 | 218 | |
| Downstream Task | 11 Downstream Tasks Aggregate | Average Accuracy64.6 | 32 | |
| Language Modeling Downstream Evaluation | Downstream tasks Average (test) | Average Score58.02 | 24 | |
| Zero-shot Evaluation | 6 zero-shot downstream tasks | Average Accuracy80.05 | 19 | |
| Diverse Language Understanding | 62 downstream tasks | Average Accuracy67.5 | 18 | |
| Utility Evaluation | Downstream Tasks | Average Accuracy63.4 | 12 | |
| Zero-shot Task Evaluation | 9 Downstream Tasks Utility | Average Accuracy54.7 | 10 | |
| Multiple Choice Evaluation | Downstream Tasks (ARC, HellaSwag, PIQA, Winogrande) Standard LLM Eval (val/test) | Average Accuracy56.98 | 9 | |
| Downstream Task Evaluation | 15 Downstream Tasks summary | Median EG2 | 7 | |
| Class-conditioned generation | Average of 8 downstream tasks | FID10.63 | 7 | |
| Zero-shot Downstream Task Evaluation | Downstream Tasks zero-shot (Arc-c, Arc-e, BoolQ, COPA, MMLU, OBQA, PIQA, RTE, Winogrande) | Arc-c49.32 | 6 | |
| Data-Incremental Learning | Downstream Tasks (test) | Accuracy62.35 | 6 | |
| Class-Incremental Learning | Downstream Tasks (test) | Accuracy62.53 | 6 | |
| Zero-shot Evaluation | Zero-shot Downstream Tasks (Arc-e, PIQA, Hellaswag, OpenBookQA, Winogrande, MMLU, BoolQ) Llama-1B Benchmark Suite (test) | Arc-e Accuracy31.63 | 5 | |
| Zero-shot Learning | 7 Downstream Tasks Avg | Average Score64.53 | 4 | |
| Downstream Task Evaluation | Downstream Tasks Aggregate | Accuracy54.37 | 3 | |
| Multiple Choice Question Answering | Downstream Tasks (ARC-E, ARC-C, SciQ, PIQA, MMLU, CMMLU, CEVAL) | ARC-E Accuracy69.7 | 2 |