| Dataset Name | SOTA Method | Metric | Trend | ||
|---|---|---|---|---|---|
| MMLU | MMLU General Knowledge Accuracy91.2 | 234 | 3d ago | ||
| MMLU (test) | SC-MAS | Accuracy87.6 | 53 | 1mo ago | |
| MMLU-Pro | Qwen-3-30B-A3B | MMLU-Pro General Knowledge Score76.7 | 38 | 1mo ago | |
| MMLU-redux | MiMo-V2-Flash Base | Accuracy90.6 | 30 | 1mo ago | |
| HellaSwag | JoyAI-LLM Flash | Accuracy91.7 | 27 | 12d ago | |
| CMMLU | Accuracy89.5 | 25 | 12d ago | ||
| MMLU | Qwen 3 VL 32B Think | Score90.1 | 25 | 1mo ago | |
| MMLU | SPINE | pass@172.48 | 22 | 1mo ago | |
| MMLU-Pro | DeepSeek-R1 | MMLU-Pro General Knowledge EM84 | 22 | 3d ago | |
| SuperGPQA | pass@148.2 | 19 | 1mo ago | ||
| Knowledge Benchmarks (ARC-C, ARC-E, MMLU, GPQA) (test) | Task Arithmetic | ARC-C83.05 | 18 | 1mo ago | |
| CEval | LongCat-Flash Chat | Score90.4 | 13 | 1mo ago | |
| C-Eval 1.0 (val) | Qwen-1.5 14B | Accuracy78.68 | 12 | 1mo ago | |
| MMLU-Pro | Qwen3.5-122B-A10B | Accuracy86.7 | 11 | 3d ago | |
| MMLU-Pro | Qwen3-8B | pass@166.46 | 11 | 1mo ago | |
| MMMLU | Qwen3-14B | CLCall Score76.1 | 10 | 1mo ago | |
| MMLU-Pro 0-shot | MARS-7B | MMLU-Pro Score (0-shot)44.4 | 9 | 9d ago | |
| MMLU | MMLU Accuracy64 | 9 | 18d ago | ||
| MMLU, HellaSwag, TruthfulQA | Sens-Merging (Task Arithmetic) | MMLU55.88 | 9 | 1mo ago | |
| MMLU EL | LayerMoE | MMLU EL (General Knowledge) Accuracy44.06 | 8 | 1mo ago | |
| GPQA Diamond | Pass@165 | 8 | 1mo ago | ||
| MMLU | PRD | Avg Input Context Length (tokens)11,691.83 | 8 | 1mo ago | |
| General Knowledge Suite MMLU-pro, MMLU | Qwen3-14B | MMLU Accuracy82.7 | 7 | 1mo ago | |
| Unified Korean Benchmark General Knowledge | Mi:dm 2.0 Base-inst | KMMLU57.3 | 7 | 1mo ago | |
| MMLU | KL | MMLU General Knowledge Drop in Utility (%)37.4 | 6 | 4d ago |