| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Zero-shot Reasoning | Reasoning Tasks (BoolQ, PIQA, HellaSwag, WinoGrande, ARC-e, ARC-c, OBQA) Zero-shot | BoolQ Accuracy (Zero-shot)82.813 | 55 | |
| Zero-shot Reasoning | Zero-Shot Reasoning Tasks (ARC-C, ARC-E, BoolQ, Hella, OBQA, PIQA, SIQA, Wino) | ARC-C Accuracy65.53 | 54 | |
| Reasoning | Reasoning Tasks Average | Average Score68.6 | 32 | |
| Single-turn Reasoning | Reasoning Tasks AIME24, AIME25, GPQA | AIME 2024 Accuracy92.2 | 18 | |
| Zero-shot Evaluation | Reasoning tasks | Reasoning Accuracy70.7 | 7 | |
| Model Ranking Prediction | Reasoning Tasks Aggregate | Spearman Rho0.81 | 6 | |
| Reasoning Chain Optimization | Reasoning Tasks | Query Count47 | 3 |