| Dataset Name | SOTA Method | Metric | Trend | ||
|---|---|---|---|---|---|
| MMLU | M2CL | Accuracy96.6 | 825 | 26d ago | |
| Nine Zero-Shot Tasks (BoolQ, HellaSwag, LAMBADA, OpenBookQA, PIQA, SIQA, WinoGrande, ARC-Easy, ARC-Challenge) | Average Accuracy73.81 | 173 | 9d ago | ||
| MMLU (test) | MMLU Average Accuracy88 | 163 | 5d ago | ||
| MMLU 5-shot (test) | Accuracy74.2 | 149 | 1mo ago | ||
| MMLU 5-shot | ERNIE 5.0-Base | Accuracy90.58 | 132 | 19d ago | |
| MMLU 0-shot | Token Filtering | Accuracy70.46 | 110 | 1mo ago | |
| MMLU | MMLU Score73.02 | 98 | 1mo ago | ||
| MMLU-Pro | gpt-oss-120B | Accuracy80.6 | 87 | 26d ago | |
| MMLU | Qwen3-14B | MMLU Accuracy87.56 | 77 | 2d ago | |
| MMLU | gpt-oss-120b | MMLU Score88.6 | 70 | 12d ago | |
| Polish Open Leaderboard | Average Performance69.84 | 53 | 4d ago | ||
| MMLU | MASA | Average Accuracy71.91 | 50 | 1mo ago | |
| MMLU o=1 Exact split | ITD | Accuracy77.6 | 42 | 1mo ago | |
| CMMLU | Qwen2-72B | Accuracy90.1 | 42 | 29d ago | |
| MMLU | GPT-4o-mini | Accuracy82.1 | 34 | 1mo ago | |
| MMLU | Mixtral-8x22B | Humanities Avg68.6 | 33 | 1mo ago | |
| MMLU | Verify-Only | Accuracy84.9 | 31 | 1mo ago | |
| MMLU | CortexDebate | RA82.33 | 31 | 1mo ago | |
| MMLU-M | ZipCal | Accuracy27.4 | 29 | 1mo ago | |
| MMLU | During-task Accuracy90.8 | 29 | 1mo ago | ||
| MMLU | Phi-4-14B | MMLU First-Token Accuracy79.7 | 24 | 12d ago | |
| MMLU-Redux | Base Score0.3762 | 24 | 1mo ago | ||
| C-Eval | Qwen2-57B-A14B | C-Eval Score87.7 | 24 | 1mo ago | |
| Aggregate ARC-C, MMLU, HellaSwag, TruthfulQA (test) | IFD | Total Score159.215 | 22 | 1mo ago | |
| MMLU | MMLU Score (x100)65.81 | 21 | 1mo ago |