| Dataset Name | SOTA Method | Metric | Trend | ||
|---|---|---|---|---|---|
| SciWorld | M2CL | Accuracy95.9 | 164 | 3mo ago | |
| ARC Challenge | Qwen-3-32B | Accuracy92.5 | 115 | 1d ago | |
| GPQA Main | Llama 70B | Accuracy43.08 | 101 | 1d ago | |
| GPQA Diamond (test) | HEART | Accuracy99.37 | 82 | 28d ago | |
| GPQA-D | RecursiveMAS | Accuracy (%)66.2 | 77 | 21d ago | |
| GPQA | MCTS with Const-o-T | Accuracy70.2 | 75 | 2mo ago | |
| TheoremQA | Accuracy82.49 | 68 | 1mo ago | ||
| GPQA Diamond | Qwen3-30B-A3B-Thinking-2507 | Score73.2 | 68 | 1mo ago | |
| GPQA Diamond | Pass@1 Accuracy78.66 | 67 | 1d ago | ||
| GPQA Diamond | Accuracy80.13 | 62 | 1d ago | ||
| GPQA | Nanbeige4.1-3B | Accuracy83.8 | 55 | 3mo ago | |
| ScienceQA | InternVL3.5 | Score96.09 | 54 | 1d ago | |
| GPQA diamond | CoT | Latency4.43 | 54 | 21d ago | |
| PrincipiaBench | RealMath Score48.74 | 50 | 2mo ago | ||
| GPQA Diamond | Conductor | Accuracy87.5 | 48 | 2mo ago | |
| SciKnowEval | VPD | Chemistry Accuracy82.38 | 47 | 14d ago | |
| GPQA | EvoEnv | pass@170.2 | 43 | 19d ago | |
| HiPhO 2024–2025 1.0 | Gemini-3-Pro | IPhO 2025 Score25.2 | 43 | 3mo ago | |
| GPQA Diamond | Accuracy65.2 | 41 | 23d ago | ||
| GPQA Diamond | RL +length penalty | Total Inference Runtime (s)1 | 36 | 1mo ago | |
| SCIENTIFIC | DEER | Accuracy82.8 | 36 | 2mo ago | |
| CEval Hard | TopoPrior+ARG | Math Score79.09 | 36 | 15d ago | |
| SuperGPQA | Agentic Proposing | Mean@150.1 | 34 | 1mo ago | |
| SciBench | Accuracy44.3 | 33 | 7d ago | ||
| AI2-ARC Scientific | CorDA | Recall79 | 32 | 8d ago |