| Dataset Name | SOTA Method | Metric | Trend | ||
|---|---|---|---|---|---|
| SciWorld | M2CL | Accuracy95.9 | 164 | 1mo ago | |
| ARC Challenge | Qwen-3-32B | Accuracy92.5 | 94 | 5d ago | |
| GPQA | MCTS with Const-o-T | Accuracy70.2 | 75 | 18d ago | |
| GPQA Diamond | Qwen3-30B-A3B-Thinking-2507 | Score73.2 | 68 | 8d ago | |
| GPQA Main | Llama 70B | Accuracy43.08 | 67 | 4d ago | |
| GPQA | Nanbeige4.1-3B | Accuracy83.8 | 55 | 1mo ago | |
| GPQA Diamond | AlphaOne | Pass@1 Accuracy66.8 | 54 | 12d ago | |
| PrincipiaBench | RealMath Score48.74 | 50 | 29d ago | ||
| GPQA Diamond | Conductor | Accuracy87.5 | 48 | 15d ago | |
| HiPhO 2024–2025 1.0 | Gemini-3-Pro | IPhO 2025 Score25.2 | 43 | 1mo ago | |
| GPQA Diamond (test) | HEART | Accuracy99.37 | 40 | 1mo ago | |
| GPQA Diamond | RL +length penalty | Total Inference Runtime (s)1 | 36 | 12d ago | |
| SCIENTIFIC | DEER | Accuracy82.8 | 36 | 1mo ago | |
| GPQA Diamond | Accuracy (pass@1)62.63 | 34 | 4d ago | ||
| SuperGPQA | Agentic Proposing | Mean@150.1 | 34 | 4d ago | |
| GPQA Diamond | Pass@164.58 | 32 | 1mo ago | ||
| ARC | SGDPO | Score86.41 | 29 | 1mo ago | |
| GPQA | TTVS | pass@156.1 | 28 | 8d ago | |
| GPQA | avg@1644.1 | 28 | 29d ago | ||
| GPQA | DTSR | Accuracy66.2 | 28 | 9d ago | |
| GPQA Diamond | Accuracy65.2 | 27 | 11d ago | ||
| MMLU-STEM | Accuracy87.4 | 27 | 15d ago | ||
| GPQA diamond | Accuracy91.9 | 24 | 1mo ago | ||
| GPQA | CoNL | Pass@179.2 | 22 | 1mo ago | |
| GPQA | Agentic Proposing | Mean@168.3 | 22 | 1mo ago |