| Dataset Name | SOTA Method | Metric | Trend | ||
|---|---|---|---|---|---|
| MMLU | M2CL | Accuracy96.6 | 756 | 2d ago | |
| MMLU 5-shot (test) | Accuracy74.2 | 149 | 3d ago | ||
| MMLU (test) | MMLU Average Accuracy88 | 136 | 3d ago | ||
| MMLU 5-shot | ERNIE 5.0-Base | Accuracy90.58 | 132 | 3d ago | |
| MMLU 0-shot | Token Filtering | Accuracy70.46 | 110 | 3d ago | |
| MMLU-Pro | gpt-oss-120B | Accuracy80.6 | 70 | 3d ago | |
| MMLU | gpt-oss-120b | MMLU Score88.6 | 45 | 3d ago | |
| MMLU o=1 Exact split | ITD | Accuracy77.6 | 42 | 3d ago | |
| MMLU | Mixtral-8x22B | Humanities Avg68.6 | 33 | 3d ago | |
| MMLU | Verify-Only | Accuracy84.9 | 31 | 3d ago | |
| MMLU | CortexDebate | RA82.33 | 31 | 3d ago | |
| MMLU | During-task Accuracy90.8 | 29 | 3d ago | ||
| CMMLU | Qwen2-72B | Accuracy90.1 | 27 | 2d ago | |
| MMLU | DEITA | MMLU Score65.43 | 24 | 3d ago | |
| MMLU-Redux | Base Score0.3762 | 24 | 3d ago | ||
| C-Eval | Qwen2-57B-A14B | C-Eval Score87.7 | 24 | 3d ago | |
| Aggregate ARC-C, MMLU, HellaSwag, TruthfulQA (test) | IFD | Total Score159.215 | 22 | 3d ago | |
| INCLUDE base 44 | Bielik-11B-v3-Instruct | Average Score64.8 | 21 | 3d ago | |
| MMLU o=1 (Semantic-level) | Accuracy76.9 | 21 | 3d ago | ||
| MMLU | Medicine Accuracy80.7 | 17 | 3d ago | ||
| Multilingual MMLU internal translated version | Accuracy85.5 | 16 | 3d ago | ||
| MMLU | Accuracy81.31 | 15 | 3d ago | ||
| MMLU en | Qwen2-FFT | MMLU (en)70.23 | 15 | 3d ago | |
| MMLU v1 (test) | Accuracy72.4 | 15 | 3d ago | ||
| MMLU French (test) | Trinity Large (MoE) | Accuracy71 | 15 | 3d ago |