| Dataset Name | SOTA Method | Metric | Trend | ||
|---|---|---|---|---|---|
| Reasoning and Knowledge Suite (MMLU, ARC-C, ARC-E, BoolQ, CSQA, HSwag, PIQA, SocIQ, Wino) (various) | Qwen3-4B | MMLU75.78 | 14 | 1mo ago | |
| Out-of-Distribution Benchmarks Summary | ICPO† | Average Score75.3 | 12 | 3d ago | |
| GSM8K, Math, AIME, HumanEval, LiveCodeBench, ARC-C, ARC-E, MMLU, GPQA | Reasoning | GSM8K95.41 | 9 | 1mo ago |