| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Reasoning | Reasoning Benchmarks BBH, MMLU, ARC-C, ThmQA (test) | BBH64.66 | 66 | |
| Reasoning | Reasoning Benchmarks ARC-e, ARC-c, BoolQ, PIQA, SIQA, HellaS., OBQA, Wino. | ARC-e Accuracy72.6 | 38 | |
| Reasoning | 15 reasoning benchmarks weighted mean (test) | Accuracy81.35 | 36 | |
| Reasoning | Reasoning Benchmarks GSM8K, MATH-500, AIME24, AIME25, GPQA-D | GSM8K Accuracy95.15 | 33 | |
| Common Sense Reasoning and Question Answering | Reasoning Benchmarks Zero-shot (PIQA, ARC, HellaSwag, WinoGrande) | PIQA Accuracy82.75 | 31 | |
| Reasoning | Reasoning Benchmarks Zero-shot | Overall Zero-Shot Accuracy69.99 | 26 | |
| General Reasoning Evaluation | Reasoning Benchmarks Aggregate | Average Score70.63 | 24 | |
| Reasoning | Reasoning Benchmarks GPQA-Diamond AIME2024 MATH500 HumanEval | Average Score85.77 | 21 | |
| Reasoning | 17 Reasoning Benchmarks Aggregate (test) | Accuracy90.71 | 21 | |
| Zero-shot evaluation | Reasoning Benchmarks Zero-shot (BoolQ, PIQA, HellaSwag, WinoGrande, ARC) | BoolQ Accuracy (Zero-shot)71.1 | 20 | |
| Mathematical Reasoning | Reasoning Benchmarks Overall | Delta Accuracy5.81 | 16 | |
| Reasoning | Reasoning Benchmarks MATH, GSM8K, AQUA, GSM-H, MMLU, MMLU-P, GPQA, AIME | MATH Accuracy84.4 | 14 | |
| Mathematical Reasoning | Reasoning Benchmarks Average | Average Accuracy44.7 | 12 | |
| Reasoning | Reasoning Benchmarks Average of MATH-500, AIME 24, AIME 25, GPQA Diamond, CommonsenseQA, LiveCodeBench, and LongBenchv2 Qwen3 | Accuracy74.8 | 12 | |
| Reasoning | Reasoning Benchmarks (GSM8K, Math, AIME, HumanEval, LiveCodeBench) | GSM8K Accuracy85.12 | 9 | |
| Reasoning | Reasoning Benchmarks (GSM8K, Math, AIME, HumanEval, LiveCodeBench) (test) | GSM8K Accuracy87.23 | 9 | |
| Zero-shot Common-sense Reasoning | Reasoning Benchmarks Zero-shot (ARC-e, ARC-c, BoolQ, PIQA, SIQA, HellaSwag, OBQA, WinoGrande) | ARC-e Accuracy74.2 | 8 | |
| Zero-shot Reasoning | Reasoning Benchmarks Zero-shot (ARC-C, ARC-E, HellaSwag, LAMBADA, OpenBookQA, PIQA, WinoGrande) | ARC-C Accuracy36.9 | 6 | |
| Mathematical Reasoning | 8 reasoning benchmarks (including GSM8K, MATH500, AIME 24, AIME 25, and OlympiadBench) (test) | Token Savings53.3 | 5 | |
| Mathematical Reasoning | Reasoning Benchmarks (AIME24, AIME25, AMC, MATH, Minerva, Olympiad) (test) | AIME24 Accuracy28.33 | 4 | |
| Multi-agent Reasoning | Reasoning Benchmarks Cooperative AutoGen framework (test) | Overall Accuracy83.58 | 2 | |
| Multi-agent Reasoning | Reasoning Benchmarks Competitive MAD framework (test) | Average Score0.8509 | 2 |