| Dataset Name | SOTA Method | Metric | Trend | ||
|---|---|---|---|---|---|
| BBH | GHG-TDA | Accuracy95.4 | 672 | 2d ago | |
| ARC Easy | GPT-4 | Accuracy96.63 | 187 | 19d ago | |
| HellaSwag (HS) | HellaSwag Accuracy86.31 | 162 | 11d ago | ||
| PIQA | LLaDA2.0-flash | Accuracy96.5 | 145 | 1mo ago | |
| GPQA Diamond | Accuracy91.9 | 135 | 5d ago | ||
| WinoGrande (WG) | InternLM2-20B | Accuracy85.2 | 135 | 4d ago | |
| 7-benchmark commonsense and reading-comprehension suite (ARC-Easy, ARC-Challenge, HellaSwag, WinoGrande, PIQA, BoolQ, and OpenBookQA) LM Evaluation Harness default (test) | LATMiX-LU | Accuracy68.77 | 108 | 1mo ago | |
| GSM8K | GPT-5.2 | Accuracy1 | 106 | 2d ago | |
| MMLU-Pro | Agent Q-Mix | Accuracy92.86 | 95 | 5d ago | |
| ARC | Qwen3-8B | Accuracy92.34 | 94 | 25d ago | |
| ARC Challenge | Qwen3 | Accuracy97.2 | 93 | 4d ago | |
| MATH 500 | Accuracy (%)100 | 90 | 2d ago | ||
| LiveBench Reasoning | DIP | Accuracy92 | 80 | 1mo ago | |
| ARC-c | Qwen3-8B | Accuracy90.36 | 80 | 11d ago | |
| OpenBookQA | BioBridge | Accuracy88.4 | 77 | 26d ago | |
| GSM PRO | ZERO-SHOT | Accuracy100 | 72 | 10d ago | |
| BBH (test) | Dynamic Persona Routing | Accuracy62.06 | 67 | 4d ago | |
| AIME 24 | PETS-On. | Accuracy70 | 58 | 1mo ago | |
| GPQA | Accuracy59.4 | 57 | 26d ago | ||
| Checkmate-in-One | RoT | Accuracy92 | 57 | 1mo ago | |
| BIG-Bench Hard (BBH) (test) | GPT-4o | Average Accuracy87.3 | 56 | 8d ago | |
| BBH 3-shot | BBH 3-shot Score65.69 | 49 | 8d ago | ||
| AIME 24 | TF-TTCL | Accuracy on AIME 2483.33 | 49 | 2d ago | |
| Bamboogle | RM-Regen | Accuracy73 | 46 | 25d ago | |
| MuSR 0-shot | UltraMix-190k | Reasoning Score (0-shot)48.82 | 46 | 1mo ago |