| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Zero-shot downstream task evaluation | LM-EVAL (Average of HellaSwag, PIQA, ARC-Easy, ARC-Challenge, and WinoGrande) zero-shot latest | Average Accuracy76 | 30 | |
| Question Answering and Commonsense Reasoning | LM Eval ARCC, ARCE, HellaSwag, PIQA 0.4.4 standard (test) | ARCC61.6 | 18 |