| Dataset Name | SOTA Method | Metric | Trend | ||
|---|---|---|---|---|---|
| MMLU | M2CL | MMLU Accuracy95.1 | 156 | 11d ago | |
| MMLU-Pro | AFlow | Accuracy82.3 | 114 | 11d ago | |
| BBH | BBH General Reasoning Accuracy94.6 | 98 | 10d ago | ||
| StratQA | Process Supervision | Accuracy87.8 | 91 | 1mo ago | |
| Super GPQA | Gemini 2.5 Pro | Accuracy71.1 | 89 | 10d ago | |
| MMLU-Pro | Critique-GRPO (CoT Critique) | pass@1 Accuracy70.47 | 69 | 1mo ago | |
| BIG-Bench Hard | Qwen 3 VL 32B Think | Accuracy91.1 | 68 | 1mo ago | |
| MMLU-Pro | MMLU-Pro General Reasoning Avg@8 Acc90.1 | 63 | 8d ago | ||
| Out-of-Distribution Performance Suite (ARC-c, GPQA*, MMLU-Pro) (test) | On-Policy | ARC-c Score91.4 | 51 | 11d ago | |
| GPQA Diamond | gemini-2.5-pro | Pass@1 Accuracy86.4 | 47 | 1mo ago | |
| Overall | DTSR | Accuracy84.8 | 40 | 9d ago | |
| MMLU-R | ROSA2 | Accuracy (MMLU-R General Reasoning)84.4 | 40 | 1mo ago | |
| BBEH | Accuracy78.8 | 39 | 24d ago | ||
| GPQA | Gemini 2.5 Pro | Accuracy86.4 | 36 | 10d ago | |
| BIG-bench | POES | Accuracy (General)81.6 | 36 | 4d ago | |
| Big-Bench Hard (BBH) (val) | TAIA | Accuracy43.46 | 36 | 1mo ago | |
| General Reasoning Suite MMLU Pro, Super GPQA, GPQA Diamond, BBEH | MMLU Pro84 | 35 | 1mo ago | ||
| GPQA diamond | ROSA (+LM + M) | Avg@8 Accuracy75.18 | 34 | 1mo ago | |
| AGIEval | GPT-4 | Exact Match70.4 | 33 | 1mo ago | |
| MMStar | ViLoMem | Score69.2 | 32 | 1mo ago | |
| MMMU | ViLoMem | Overall Score75.4 | 32 | 1mo ago | |
| AGI Eval English | Qwen 3 VL 8B Think | Score90.1 | 32 | 1mo ago | |
| GPQA | EVOL-RL | pass@145.2 | 26 | 26d ago | |
| MMLU | Denser | Accuracy81.5 | 25 | 1mo ago | |
| Overall MATH-500 AIME25 HumanEval GPQA | Accuracy85.1 | 24 | 1mo ago |