| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Zero-shot Reasoning | Reasoning Suite Zero-shot (PIQA, HellaSwag, WinoGrande, ARC-e, ARC-c) (val test) | Average Accuracy76.55 | 297 | |
| Zero-shot Reasoning | Reasoning Suite (ARC-e, ARC-c, HellaSwag, PIQA, Winogrande) zero-shot | Average Reasoning Score6,540 | 107 | |
| Commonsense Reasoning | Reasoning Suite Zero-shot Aggregate | Aggregate Score73.2 | 50 | |
| Reasoning | Reasoning Suite Average | Accuracy74.8 | 45 | |
| Reasoning and Language Modeling | Reasoning Suite (ARC, HellaSwag, PIQA, WinoGrande, MMLU, OpenBookQA, Real-world QA) Zero-shot Llama-3.1-8B-Instruct with Alpaca calibration | PPL9.63 | 32 | |
| Zero-shot Language Understanding | Reasoning Suite Zero-shot (BoolQ, WinoG., PIQA, OBQA, HellaS., ARC-e, ARC-c) | BoolQ Accuracy82.63 | 24 | |
| Zero-shot Accuracy | Reasoning Suite Zero-shot (PIQA, Hella Swag, LAMBADA, ARC-e, ARC-c, SciQ, Race, MMLU) | PIQA80.7 | 21 | |
| Zero-shot Reasoning | Reasoning Suite PiQA, LAMBDA, ARC, HellaSwag | PiQA Score62.69 | 20 | |
| Question Answering | Reasoning Suite Zero-shot (ArcC, ArcE, PiQA, Wino) | Arc Challenge (C) Accuracy50.43 | 16 | |
| Zero-shot Learning | Reasoning Suite Zero-shot (ARC-e, ARC-c, WG, BQ, PIQA, HS, OBQA, HQA) | ARC-e Accuracy49.7 | 9 | |
| Zero-shot Reasoning | Reasoning Suite Zero-shot (PIQA, ARC, HS, WG, BoolQ, MMLU) | PIQA Accuracy80.2 | 9 | |
| Zero-shot Commonsense Reasoning | Reasoning Suite Zero-shot (ARC-E, BoolQ, HSwag, LAMBADA, OBQA, PIQA, SocIQA, WinoGr.) | ARC-E Accuracy45.5 | 9 | |
| Reasoning | Reasoning Suite | GSM8K85.12 | 9 | |
| Reasoning | Reasoning Suite BBH, GPQA, MuSR | BBH83.4 | 7 | |
| Reasoning | Reasoning Suite (MMLU-Pro, GPQA Diamond, AIME-24, AIME-25) zero-shot | MMLU-Pro Accuracy74.8 | 6 | |
| Zero-shot Reasoning | Reasoning Suite Zero-shot (PIQA, ARCe, ARCc, BoolQ, Hella., Wino.) | PIQA Accuracy79.05 | 4 | |
| Logical and Commonsense Reasoning | Reasoning Suite | BIG-Bench Hard89.36 | 4 |