| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Language Model Downstream Evaluation | LM Evaluation Harness zero-shot and five-shot | HellaSwag Acc72.52 | 30 | |
| Zero-shot Language Modeling | LM Evaluation Harness 0-shot | WG80.66 | 30 | |
| Multi-task Language Understanding | LM Evaluation Harness (test) | ARC Challenge Acc44.28 | 24 | |
| Language Modeling | LM Evaluation Harness (LM Eval) (test) | WG (Winograd Schema)74.11 | 22 | |
| Downstream task evaluation | LM Evaluation Harness MMLU, ARC, HellaSwag, TruthfulQA, Winogrande, CommonsenseQA | MMLU70.5 | 19 | |
| Natural Language Understanding | LM Evaluation Harness | MMLU (CoT)72.76 | 19 | |
| Zero-shot Evaluation | lm-evaluation-harness (SciQ, ARC-E, ARC-C, LogiQA, OBQA, BoolQ, HellaSwag, PIQA, WinoGrande) zero-shot | SciQ Accuracy69.4 | 19 | |
| Zero-shot Language Understanding | LM Evaluation Harness Downstream Suite (HellaSwag, PIQA, WinoGrande, OpenBookQA, SIQA, BoolQ, TriviaQA, MMLU, ARC-Challenge, ARC-Easy, MathQA, SciQ) | HellaSwag Accuracy72.52 | 18 | |
| Zero-shot Evaluation | LM Evaluation Harness PIQA, HellaSwag, COPA, RTE, OpenBookQA, LAMBADA-OpenAI | Average Score75.97 | 16 | |
| Language Modeling Evaluation | LM Evaluation Harness | Accuracy60.35 | 16 | |
| Downstream Task Evaluation | LM Evaluation Harness MMLU, ARC-Challenge, HellaSwag, TruthfulQA, Winogrande, GSM8K standard | MMLU65.8 | 16 | |
| World Knowledge and Reading Comprehension | LM Evaluation Harness NQ, MMLU STEM, ARC, SciQ, LogiQA, BoolQ | NQ Accuracy29.81 | 15 | |
| Commonsense Reasoning | LM-Evaluation-Harness Commonsense Reasoning: LAMBADA, WikiText, ARC, HellaSwag, PIQA, WinoGrande, BoolQ, SciQ | LAMBADA Perplexity (PPL)11.86 | 13 | |
| Zero-shot Natural Language Understanding | LM-Evaluation-Harness ARC, BoolQ, HellaSwag, LAMBADA, PIQA, RACE, SciQ, Record, OBQA | ARC Challenge46.8 | 13 | |
| Natural Language Understanding | lm-evaluation-harness suite (HellaSwag, RACE, PIQA, WinoGrande, ARC-e, ARC-c, OBQA) | HellaSwag57.18 | 12 | |
| Language Understanding and Reasoning | LM-Evaluation-Harness ARC-c, ARC-e, BoolQ, HellaS., MMLU, OBQA, PIQA, WG | ARC-c Accuracy58.4 | 12 | |
| Language Model Evaluation Suite | LM Evaluation Harness | Avg Accuracy66.6 | 8 | |
| Natural Language Understanding | LM Evaluation Harness | WG Score57.2 | 5 | |
| Downstream Evaluation | lm-evaluation-harness ARC-E, BoolQ, HellaSwag, OBQA, SciQ | ARC-E Accuracy0.381 | 4 | |
| Zero-shot downstream evaluation | LM Evaluation Harness 0-shot v1.0.0 | HellaSwag Accuracy50.1 | 4 |