| Dataset Name | SOTA Method | Metric | Trend | ||
|---|---|---|---|---|---|
| MMLU-Pro | AFlow | Accuracy82.3 | 201 | 5d ago | |
| BBH | SABA | Accuracy93.2 | 190 | 6d ago | |
| MMLU | M2CL | MMLU Accuracy95.1 | 180 | 7d ago | |
| BBH | BBH General Reasoning Accuracy94.6 | 103 | 6d ago | ||
| Super GPQA | Gemini 2.5 Pro | Accuracy71.1 | 99 | 23d ago | |
| MMLU-Pro | pass@1 Accuracy73.44 | 93 | 22d ago | ||
| StratQA | Process Supervision | Accuracy87.8 | 91 | 3mo ago | |
| BIG-Bench Hard | Qwen 3 VL 32B Think | Accuracy91.1 | 68 | 8d ago | |
| Out-of-Distribution Performance Suite (ARC-c, GPQA*, MMLU-Pro) (test) | On-Policy | ARC-c Score91.4 | 66 | 2d ago | |
| BBEH | Accuracy78.8 | 64 | 15d ago | ||
| General Reasoning Suite Average | Pass@178.3 | 63 | 12d ago | ||
| MMLU-Pro | MMLU-Pro General Reasoning Avg@8 Acc90.1 | 63 | 1mo ago | ||
| GPQA | Gemini 2.5 Pro | Accuracy86.4 | 59 | 6d ago | |
| MMLU | Pivot-SFT | Accuracy86.21 | 51 | 9d ago | |
| LiveBench | PerSyn | Accuracy53.47 | 50 | 14d ago | |
| MMMU | ViLoMem | Overall Score75.4 | 48 | 4d ago | |
| GPQA Diamond | gemini-2.5-pro | Pass@1 Accuracy86.4 | 47 | 2mo ago | |
| Overall | DTSR | Accuracy84.8 | 40 | 1mo ago | |
| MMLU-R | ROSA2 | Accuracy (MMLU-R General Reasoning)84.4 | 40 | 2d ago | |
| GPQA | SRGen | pass@165.7 | 38 | 2d ago | |
| BIG-bench | POES | Accuracy (General)81.6 | 36 | 1mo ago | |
| Big-Bench Hard (BBH) (val) | TAIA | Accuracy43.46 | 36 | 3mo ago | |
| GPQA-Diamond & MMLU-Pro | Scaf-GRPO | Accuracy53.6 | 35 | 21d ago | |
| ARC-C | MAE | Accuracy98 | 35 | 6d ago | |
| General Reasoning Suite MMLU Pro, Super GPQA, GPQA Diamond, BBEH | MMLU Pro84 | 35 | 2mo ago |