| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Language Understanding and Reasoning | LLM Evaluation Suite ARC-e, ARC-c, HellaSwag, OBQA, WinoGrande, MathQA, PIQA | Average Accuracy64.01 | 19 | |
| General Language Understanding and Reasoning | LLM Evaluation Suite (ARC, CSQA, GSM8K, HS, MMLU, OBQA, PIQA, SIQA, TQA, WG) | ARC45.9 | 14 | |
| Zero-shot Language Understanding and Reasoning | LLM Evaluation Suite (MMLU, ARC-C, PIQA, WinoG, GSM8K, HellaSwag, GPQA, RACE) zero-shot LLaDA1.5 | Average Score58.59 | 13 | |
| Language Modeling and Reasoning | LLM Evaluation Suite ARC, BBH, HellaSwag, TruthfulQA, LAMBADA, WinoGrande, GSM8K, MT-Bench | ARC (Accuracy)54.61 | 3 |