| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Language Modeling | LLM Evaluation Suite | Accuracy83.2 | 53 | |
| Language Understanding and Reasoning | LLM Evaluation Suite MMLU, GSM8k, HellaSwag, WinoGrande | MMLU Score64.43 | 31 | |
| Language Understanding and Reasoning | LLM Evaluation Suite ARC-e, ARC-c, HellaSwag, OBQA, WinoGrande, MathQA, PIQA | Average Accuracy64.01 | 19 | |
| General Language Understanding and Reasoning | LLM Evaluation Suite (ARC, CSQA, GSM8K, HS, MMLU, OBQA, PIQA, SIQA, TQA, WG) | ARC45.9 | 14 | |
| Zero-shot Language Understanding and Reasoning | LLM Evaluation Suite (MMLU, ARC-C, PIQA, WinoG, GSM8K, HellaSwag, GPQA, RACE) zero-shot LLaDA1.5 | Average Score58.59 | 13 | |
| Zero-shot Language Understanding and Reasoning | LLM Evaluation Suite (HellaSwag, MMLU, ARC-C, BoolQ, Lambada, ARC-E, HumanEval) zero-shot Qwen3-30B-A3B | HellaSwag Accuracy79.8 | 12 | |
| Model Merging | LLM Evaluation Suite | Normalized Score0.401 | 12 | |
| Zero-shot Language Understanding | LLM Evaluation Suite MMLU, GSM8k, HellaSwag, WinoGrande | MMLU72.8 | 12 | |
| Language Modeling and Reasoning | LLM Evaluation Suite ARC, BBH, HellaSwag, TruthfulQA, LAMBADA, WinoGrande, GSM8K, MT-Bench | ARC (Accuracy)54.61 | 3 |