| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Zero-shot Reasoning | Reasoning Suite Zero-shot (PIQA, HellaSwag, WinoGrande, ARC-e, ARC-c) (val test) | PIQA81.77 | 119 | |
| Commonsense Reasoning | Reasoning Suite Zero-shot Aggregate | Aggregate Score73.2 | 45 | |
| Reasoning | Reasoning Suite Average | Accuracy72.8 | 36 | |
| Zero-shot Accuracy | Reasoning Suite Zero-shot (PIQA, Hella Swag, LAMBADA, ARC-e, ARC-c, SciQ, Race, MMLU) | PIQA80.7 | 21 | |
| Reasoning | Reasoning Suite | GSM8K85.12 | 9 | |
| Zero-shot Reasoning | Reasoning Suite (ARC-e, ARC-c, HellaSwag, PIQA, Winogrande) zero-shot | ARC-e Accuracy0.7559 | 8 | |
| Reasoning | Reasoning Suite BBH, GPQA, MuSR | BBH83.4 | 7 | |
| Logical and Commonsense Reasoning | Reasoning Suite | BIG-Bench Hard89.36 | 4 |