| Dataset Name | SOTA Method | Metric | Trend | ||
|---|---|---|---|---|---|
| MMLU | MMLU General Knowledge Accuracy91.2 | 307 | 8d ago | ||
| MMLU-Pro | Qwen-3-30B-A3B | MMLU-Pro General Knowledge Score76.7 | 55 | 2d ago | |
| MMLU (test) | SC-MAS | Accuracy87.6 | 53 | 3mo ago | |
| CMMLU | Accuracy89.5 | 50 | 19d ago | ||
| HellaSwag | JoyAI-LLM Flash | Accuracy91.7 | 36 | 1mo ago | |
| MMLU-Pro | Qwen3.5-122B-A10B | Accuracy86.7 | 33 | 22d ago | |
| MMLU | SPINE | pass@172.48 | 31 | 1mo ago | |
| MMLU-redux | MiMo-V2-Flash Base | Accuracy90.6 | 30 | 3mo ago | |
| MMLU | PRISM | General Score76.1 | 25 | 2d ago | |
| MMLU | Qwen 3 VL 32B Think | Score90.1 | 25 | 3mo ago | |
| MMLU-Pro | DeepSeek-R1 | MMLU-Pro General Knowledge EM84 | 22 | 1mo ago | |
| MMLU-Pro | DLE (ε-sampling)-PROBFIRST | maj@4 Accuracy35.88 | 21 | 1mo ago | |
| MMLU-Pro | Qwen3-8B | pass@166.46 | 20 | 1mo ago | |
| SuperGPQA | pass@148.2 | 19 | 3mo ago | ||
| CEval | LongCat-Flash Chat | Score90.4 | 19 | 1mo ago | |
| Knowledge Benchmarks (ARC-C, ARC-E, MMLU, GPQA) (test) | Task Arithmetic | ARC-C83.05 | 18 | 3mo ago | |
| GPQA Diamond | Pass@165 | 17 | 1mo ago | ||
| GPQA | I-DPO + MaPPO | Accuracy33.3 | 15 | 21d ago | |
| GPQA Diamond | GLM-4.7-Flash-T | Accuracy76.7 | 15 | 14d ago | |
| Global MMLU Ukrainian (test) | Gemma 3 12B PT | Accuracy (%)67.03 | 14 | 22d ago | |
| C-Eval 1.0 (val) | Qwen-1.5 14B | Accuracy78.68 | 12 | 3mo ago | |
| MMMLU | Qwen3-14B | CLCall Score76.1 | 10 | 2mo ago | |
| MMLU-Pro 0-shot | MARS-7B | MMLU-Pro Score (0-shot)44.4 | 9 | 1mo ago | |
| MMLU, HellaSwag, TruthfulQA | Sens-Merging (Task Arithmetic) | MMLU55.88 | 9 | 3mo ago | |
| MMLU EL | LayerMoE | MMLU EL (General Knowledge) Accuracy44.06 | 8 | 2mo ago |