| Dataset Name | SOTA Method | Metric | Trend | ||
|---|---|---|---|---|---|
| LogicVista | CodePercept-32B-S1 | Accuracy70.02 | 23 | 1mo ago | |
| MathVision | CodePercept-32B-S1 | Accuracy69.96 | 23 | 1mo ago | |
| GPQA Diamond | SwiR | Accuracy70.2 | 16 | 1mo ago | |
| TheoremQA | RLPR | Avg@255.4 | 16 | 1mo ago | |
| Super GPQA | MTP-D | Speedup Ratio2.096 | 15 | 23d ago | |
| GPQA-Diamond, PHYBench, BIOBench | Pass@191.9 | 15 | 25d ago | ||
| GPQA Diamond | PAPO | Accuracy (avg@4)55 | 12 | 20d ago | |
| GPQA Diamond | SCVC | AUROC82.1 | 10 | 29d ago | |
| GPQA-Diamond 5-shot | ERNIE 5.0-Base | Accuracy57.3 | 10 | 1mo ago | |
| TheoremQA | Bingo-A | Accuracy36.8 | 8 | 11d ago | |
| Minerva | Training-time reweighting | Avg@3256.57 | 8 | 25d ago | |
| GPQA | Qwen3-8B | Score60.9 | 7 | 1mo ago | |
| AIME 2025 | Qwen3-8B | Score67.6 | 7 | 1mo ago | |
| AIME 2024 | Qwen3-8B-as-GenRM | Score77.7 | 7 | 1mo ago | |
| Minerva 272 undergraduate-level STEM problems | Ours | Avg@32 Score56.57 | 3 | 25d ago |