| Dataset Name | SOTA Method | Metric | Trend | ||
|---|---|---|---|---|---|
| ScienceQA image | InternVL2.5-8B | Accuracy98 | 259 | 1d ago | |
| GPQA Diamond | Accuracy84.4 | 123 | 1d ago | ||
| ScienceQA | MASQuant | Accuracy88.6 | 61 | 2mo ago | |
| GPQA | Spurious | Accuracy54.44 | 40 | 5d ago | |
| SciQA | Qwen2.5-VL-32B | Accuracy91.4 | 35 | 7d ago | |
| GPQA | CoT | Average Inference Time (s)1.58 | 30 | 2mo ago | |
| GPQA | LARFT | Score33.48 | 28 | 2mo ago | |
| MMLU | EVOSELECT | Accuracy79.45 | 27 | 1mo ago | |
| GPQA Diamond | Qwen3-4B-Thinking-2507 + DEER | Accuracy (ACC)53.03 | 22 | 21d ago | |
| GPQA | MOTAB | Pass@879.29 | 21 | 14d ago | |
| GPQA D | GRPO | Accuracy (mean@4)38.5 | 21 | 22d ago | |
| GPQA Main (test) | DecepChain | P@123.04 | 20 | 12d ago | |
| GPQA-D | IOA | Accuracy (GPQA-D)14.43 | 20 | 3mo ago | |
| SciRAG-SSLI hard 1.0 (test) | F1 Score46.86 | 19 | 3mo ago | ||
| SciRAG-SSLI easy 1.0 (test) | RankGPT | F1 Score46.55 | 19 | 3mo ago | |
| GPQA | CISPO | pass@118.2 | 18 | 3mo ago | |
| GPQA Main | DecepChain | LLM Trust Score98.39 | 16 | 12d ago | |
| MMLU-Pro | Accuracy88.6 | 16 | 5d ago | ||
| GPQA Diamond (test) | Transformers | Pass@149 | 16 | 3mo ago | |
| GPQA | Skywork | Biology Domain Score65.8 | 14 | 7d ago | |
| GPQA Diamond | Qwen3 | pass@165.2 | 14 | 20d ago | |
| GPQA | Avg Response Length9,032 | 13 | 21d ago | ||
| HLE-Verified Gold (test) | ATLAS-MM | Accuracy60 | 12 | 1d ago | |
| SuperGPQA* | Accuracy62.4 | 12 | 5d ago | ||
| GPQA | Accuracy82.4 | 12 | 5d ago |