| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Question Answering and Commonsense Reasoning | NLP Benchmark Suite Zero-shot (HellaSwag, RACE, PIQA, WinoGrande, ARC, OBQA) (test) | HellaSwag Accuracy63.36 | 28 | |
| Language Modeling | NLP Benchmark Suite Aggregate | Average Delta-9.2 | 16 | |
| Aggregate NLP Evaluation | NLP Benchmark Suite Average | Average Accuracy64 | 9 |