| BBH | GHG-TDA | Accuracy95.4 | | 507 | 3d ago |
| ARC Easy | GPT-4 | Accuracy96.63 | | 183 | 2d ago |
| HellaSwag (HS) | | HellaSwag Accuracy86.31 | | 142 | 3d ago |
| PIQA | LLaDA2.0-flash | Accuracy96.5 | | 133 | 3d ago |
| 7-benchmark commonsense and reading-comprehension suite (ARC-Easy, ARC-Challenge, HellaSwag, WinoGrande, PIQA, BoolQ, and OpenBookQA) LM Evaluation Harness default (test) | LATMiX-LU | Accuracy68.77 | | 108 | 3d ago |
| GPQA Diamond | | Accuracy91.9 | | 88 | 3d ago |
| WinoGrande (WG) | InternLM2-20B | Accuracy85.2 | | 87 | 3d ago |
| GSM8K | GPT-5.2 | Accuracy1 | | 83 | 2d ago |
| ARC | Qwen3-8B | Accuracy92.34 | | 83 | 3d ago |
| LiveBench Reasoning | DIP | Accuracy92 | | 80 | 3d ago |
| ARC Challenge | KALE | Accuracy91.09 | | 70 | 2d ago |
| OpenBookQA | BioBridge | Accuracy88.4 | | 63 | 3d ago |
| MATH 500 | | Accuracy (%)100 | | 59 | 3d ago |
| MMLU-Pro | | Accuracy90.1 | | 50 | 3d ago |
| Humanity's Last Exam | HEART | Accuracy84.61 | | 46 | 3d ago |
| ARC Challenge | | Accuracy96.7 | | 45 | 3d ago |
| SIQA | | Accuracy83.2 | | 44 | 3d ago |
| ARC-c | Qwen3-8B | Accuracy90.36 | | 42 | 3d ago |
| AIME 24 | | Accuracy on AIME 2480 | | 41 | 3d ago |
| CoT-Collection Scenario 1 | LaDa | Accuracy70 | | 40 | 3d ago |
| AIME 25 | Parallel-Probe | Accuracy76.9 | | 40 | 3d ago |
| BBH (test) | DeepSeekMath-Base | Accuracy59.5 | | 40 | 3d ago |
| GPQA | Layer-wise Caprese | Accuracy55.05 | | 38 | 3d ago |
| Reasoning Suite Average | GHG-TDA | Accuracy72.8 | | 36 | 3d ago |
| GSM8K (test) | MCTS | EM Accuracy89.6 | | 35 | 3d ago |