| Dataset Name | SOTA Method | Metric | Trend | ||
|---|---|---|---|---|---|
| MMLU | DeepSeek V3.2 | MMLU General Knowledge Accuracy91.1 | 170 | 4d ago | |
| MMLU-Pro | Qwen-3-30B-A3B | MMLU-Pro General Knowledge Score76.7 | 38 | 4d ago | |
| MMLU (test) | SC-MAS | Accuracy87.6 | 33 | 4d ago | |
| MMLU | Qwen 3 VL 32B Think | Score90.1 | 25 | 4d ago | |
| MMLU-redux | MiMo-V2-Flash Base | Accuracy90.6 | 20 | 4d ago | |
| Knowledge Benchmarks (ARC-C, ARC-E, MMLU, GPQA) (test) | Task Arithmetic | ARC-C83.05 | 18 | 2d ago | |
| MMLU | pass@168.39 | 16 | 4d ago | ||
| CEval | LongCat-Flash Chat | Score90.4 | 13 | 4d ago | |
| HellaSwag | Qwen3-1.7B | Accuracy59.4 | 13 | 4d ago | |
| C-Eval 1.0 (val) | Qwen-1.5 14B | Accuracy78.68 | 12 | 4d ago | |
| SuperGPQA | Qwen3-8B | pass@136.21 | 11 | 4d ago | |
| MMLU-Pro | Qwen3-8B | pass@166.46 | 11 | 4d ago | |
| CMMLU | Accuracy88.4 | 9 | 4d ago | ||
| MMLU, HellaSwag, TruthfulQA | Sens-Merging (Task Arithmetic) | MMLU55.88 | 9 | 4d ago | |
| MMLU | PRD | Avg Input Context Length (tokens)11,691.83 | 8 | 4d ago | |
| General Knowledge Suite MMLU-pro, MMLU | Qwen3-14B | MMLU Accuracy82.7 | 7 | 4d ago | |
| Unified Korean Benchmark General Knowledge | Mi:dm 2.0 Base-inst | KMMLU57.3 | 7 | 4d ago | |
| MMLU | RLBF | Solution Rate70.7 | 6 | 4d ago | |
| MMLU Redux | Exact Match92.9 | 6 | 4d ago | ||
| MMLU | EM91.8 | 6 | 4d ago | ||
| MMLU-Pro | DeepSeek-R1 | MMLU-Pro General Knowledge EM84 | 5 | 4d ago | |
| GPQA full | RePO | pass@161.8 | 4 | 4d ago | |
| SuperGPQA (test) | MENTORCOLLAB MLP | Accuracy18.6 | 4 | 4d ago |