| Dataset Name | SOTA Method | Metric | Trend | ||
|---|---|---|---|---|---|
| MMLU | M2CL | MMLU Accuracy95.1 | 126 | 3d ago | |
| StratQA | Process Supervision | Accuracy87.8 | 91 | 3d ago | |
| BIG-Bench Hard | Qwen 3 VL 32B Think | Accuracy91.1 | 68 | 2d ago | |
| MMLU-Pro | MMLU-Pro General Reasoning Avg@8 Acc90.1 | 51 | 3d ago | ||
| MMLU-Pro | AFlow | Accuracy82.3 | 48 | 3d ago | |
| BBH | Kimi-K2 Base | BBH General Reasoning Accuracy88.7 | 43 | 3d ago | |
| Big-Bench Hard (BBH) (val) | TAIA | Accuracy43.46 | 36 | 3d ago | |
| AGIEval | GPT-4 | Exact Match70.4 | 33 | 3d ago | |
| MMStar | ViLoMem | Score69.2 | 32 | 3d ago | |
| MMMU | ViLoMem | Overall Score75.4 | 32 | 3d ago | |
| AGI Eval English | Qwen 3 VL 8B Think | Score90.1 | 32 | 3d ago | |
| BIG-bench | SFT based | Accuracy @ t174.6 | 29 | 3d ago | |
| MMLU-Pro | Critique-GRPO (CoT Critique) | pass@1 Accuracy70.47 | 27 | 3d ago | |
| General-R MMLU-stem, ARC-challenge (test) | Accuracy61.8 | 24 | 3d ago | ||
| MMLU-P | Accuracy75.6 | 24 | 3d ago | ||
| HLE | Accuracy38.4 | 21 | 3d ago | ||
| BBH (Big-Bench-Hard) (test) | ICL+FT | Accuracy81.8 | 20 | 2d ago | |
| BBEH | Accuracy78.8 | 19 | 3d ago | ||
| General Reasoning Suite MMLU Pro, Super GPQA, GPQA Diamond, BBEH | General-Reasoner | MMLU Pro65.1 | 19 | 3d ago | |
| MATH-500, GPQA-D, MMLU-P, GSM8K, ARC-C Aggregate | STAR-1-mix | Average Score85.41 | 18 | 3d ago | |
| Global MMLU | MMLU36.1 | 16 | 3d ago | ||
| GPQA Diamond | R-Zero | Pass@1 Accuracy40.5 | 16 | 3d ago | |
| Super GPQA | Absolute Zero | pass@1 Acc33.5 | 16 | 3d ago | |
| Average of Reasoning Tasks | PASER | Average Accuracy63.31 | 15 | 3d ago | |
| MMLU | Denser | Accuracy81.5 | 15 | 3d ago |