| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Zero-shot Language Modeling | LM Evaluation Harness 0-shot | WG80.66 | 30 | |
| Multi-task Language Understanding | LM Evaluation Harness (test) | ARC Challenge Acc44.28 | 24 | |
| Language Modeling | LM Evaluation Harness (LM Eval) (test) | WG (Winograd Schema)74.11 | 22 | |
| Natural Language Understanding | LM Evaluation Harness | MMLU (CoT)72.76 | 19 | |
| Downstream Task Evaluation | LM Evaluation Harness MMLU, ARC-Challenge, HellaSwag, TruthfulQA, Winogrande, GSM8K standard | MMLU65.8 | 16 | |
| World Knowledge and Reading Comprehension | LM Evaluation Harness NQ, MMLU STEM, ARC, SciQ, LogiQA, BoolQ | NQ Accuracy29.81 | 15 | |
| Zero-shot Evaluation | lm-evaluation-harness (SciQ, ARC-E, ARC-C, LogiQA, OBQA, BoolQ, HellaSwag, PIQA, WinoGrande) zero-shot | SciQ Accuracy68.2 | 13 | |
| Zero-shot Natural Language Understanding | LM-Evaluation-Harness ARC, BoolQ, HellaSwag, LAMBADA, PIQA, RACE, SciQ, Record, OBQA | ARC Challenge46.8 | 13 | |
| Language Understanding and Reasoning | LM-Evaluation-Harness ARC-c, ARC-e, BoolQ, HellaS., MMLU, OBQA, PIQA, WG | ARC-c Accuracy58.4 | 12 | |
| Language Model Evaluation Suite | LM Evaluation Harness | Avg Accuracy66.6 | 8 |