| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Language Modeling and Reasoning | Open LLM Leaderboard | ARC82.8 | 33 | |
| Open-style response generation | Open-LLM Leaderboard | Accuracy53.45 | 28 | |
| Language Model Evaluation | Open LLM Leaderboard v2 (test) | BBH60.84 | 20 | |
| Unified Multi-task Language Understanding and Instruction Following | Open LLM Leaderboard v1 (test) | MMLU-P Accuracy11.5 | 19 | |
| Large Language Model Evaluation | Open LLM Leaderboard | Average Score74.2 | 19 | |
| General Language Understanding and Reasoning | Open LLM Leaderboard Lighteval (test) | Mean Accuracy91.07 | 17 | |
| Language Modeling | Open LLM Leaderboard & General Ability Benchmarks (MMLU-P, GPQA, BBH, MATH, MuSR, IFEval, ARC, Hellaswag, PIQA, BoolQ, WinoGrande, COPA, OpenBookQA, SciQ) unified (test) | MMLU-P Accuracy12 | 16 | |
| Large Language Model Evaluation | Open LLM Leaderboard v1 (test) | Average Score69.6 | 14 | |
| Language Modeling Evaluation | Open LLM Leaderboard | ARC70.22 | 14 | |
| Natural Language Understanding | Open LLM Leaderboard (test) | ARC57.94 | 13 | |
| General LLM Evaluation | Open LLM Leaderboard (test) | ARC-c78.92 | 12 | |
| General Language Understanding | Open LLM Leaderboard (test) | ARC62.03 | 9 | |
| Reasoning and Language Understanding | Open LLM Leaderboard MMLU-PRO, IFEval, BBH, GPQA, MATH, GSM8K, ARC v0.4.0 (test) | MMLU-PRO28.38 | 7 | |
| General Language Understanding | Open LLM leaderboard | Average Score65.51 | 7 | |
| Downstream Language Understanding | Open LLM Leaderboard zero-shot | ARCE52.9 | 6 | |
| Unified Multi-task Language Understanding and Instruction Following | Open LLM Leaderboard | MMLU-P (Accuracy)- | 0 |