| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Zero-shot performance evaluation | LM Eval Harness (HellaSwag, BoolQ, WinoGrande, PiQA, ARC-easy, ARC-challenge) zero-shot | Mean Accuracy75.46 | 60 | |
| Zero-shot Question Answering and Reasoning | LM-Eval-Harness Suite (PIQA, HellaSwag, LAMBADA, ARC-e, ARC-c, SciQ, Race, MMLU) zero-shot | PIQA80.7 | 32 | |
| Question Answering and Commonsense Reasoning | lm-eval-harness PIQA, COPA, OpenBookQA, Winogrande, SciQA, ARC-E, ARC-C | PIQA Accuracy78.8 | 10 | |
| General Language Model Reasoning | LM-Eval-Harness Hungarian | Arc (hu) Acc38.6 | 4 | |
| Language Modeling Utility | LM Eval Harness | HellaSwag Accuracy0.48 | 3 |