| Dataset Name | SOTA Method | Metric | Trend | ||
|---|---|---|---|---|---|
| SciWorld | M2CL | Accuracy95.9 | 164 | 3d ago | |
| ARC Challenge | Qwen-3-32B | Accuracy92.5 | 56 | 3d ago | |
| GPQA | Nanbeige4.1-3B | Accuracy83.8 | 55 | 3d ago | |
| GPQA | Full-CoT | Accuracy45.5 | 50 | 3d ago | |
| GPQA Diamond | Conductor | Accuracy87.5 | 45 | 3d ago | |
| HiPhO 2024–2025 1.0 | Gemini-3-Pro | IPhO 2025 Score25.2 | 43 | 3d ago | |
| GPQA Diamond (test) | HEART | Accuracy99.37 | 32 | 3d ago | |
| GPQA Diamond | Pass@164.58 | 32 | 3d ago | ||
| ARC | SGDPO | Score86.41 | 29 | 3d ago | |
| GPQA Diamond | RLCER | Score48.77 | 28 | 3d ago | |
| GPQA Diamond | Accuracy (pass@1)62.63 | 24 | 3d ago | ||
| GPQA diamond | Accuracy91.9 | 24 | 3d ago | ||
| SuperGPQA | Agentic Proposing | Mean@150.1 | 24 | 3d ago | |
| GPQA | CoNL | Pass@179.2 | 22 | 3d ago | |
| GPQA | Agentic Proposing | Mean@168.3 | 22 | 3d ago | |
| GPQA Diamond | LED | P@1691.94 | 21 | 3d ago | |
| SciEval | GPT-4 | Score73.93 | 20 | 3d ago | |
| CEval Sci | SciGLM | Score66.19 | 20 | 3d ago | |
| CEval Hard | SciGLM | Overall Score56.58 | 19 | 3d ago | |
| Science Domain In-Domain: SampleQA, GPQA(ALL), HLE | FlowRL | SampleQA Score3.26 | 18 | 3d ago | |
| GPQA Diamond | PPPO | Accuracy (avg@32)58.13 | 18 | 3d ago | |
| GPQA 1.0 (test) | A3PO | Accuracy53.8 | 18 | 3d ago | |
| ARC Easy | Sink Attention | Accuracy73.4 | 18 | 3d ago | |
| MMLU-Sci | Galactica-30B | Score54.96 | 18 | 3d ago | |
| Aggregate GPQA, HLE, MMLU-Pro | Dr.SCI-4B-think | Average Score44.6 | 17 | 3d ago |