| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Reasoning | Reasoning Benchmarks GSM8K, MATH-500, AIME24, AIME25, GPQA-D | GSM8K Accuracy95.15 | 33 | |
| Common Sense Reasoning and Question Answering | Reasoning Benchmarks Zero-shot (PIQA, ARC, HellaSwag, WinoGrande) | PIQA Accuracy82.75 | 31 | |
| General Reasoning Evaluation | Reasoning Benchmarks Aggregate | Average Score70.63 | 24 | |
| Reasoning | 17 Reasoning Benchmarks Aggregate (test) | Accuracy90.71 | 21 | |
| Zero-shot evaluation | Reasoning Benchmarks Zero-shot (BoolQ, PIQA, HellaSwag, WinoGrande, ARC) | BoolQ Accuracy (Zero-shot)71.1 | 20 | |
| Reasoning | Reasoning Benchmarks Zero-shot | PIQA Accuracy80.79 | 16 | |
| Mathematical Reasoning | Reasoning Benchmarks Overall | Delta Accuracy5.81 | 16 | |
| Reasoning | Reasoning Benchmarks MATH, GSM8K, AQUA, GSM-H, MMLU, MMLU-P, GPQA, AIME | MATH Accuracy84.4 | 14 | |
| Reasoning | Reasoning Benchmarks Average of MATH-500, AIME 24, AIME 25, GPQA Diamond, CommonsenseQA, LiveCodeBench, and LongBenchv2 Qwen3 | Accuracy74.8 | 12 | |
| Reasoning | Reasoning Benchmarks (GSM8K, Math, AIME, HumanEval, LiveCodeBench) | GSM8K Accuracy85.12 | 9 | |
| Reasoning | Reasoning Benchmarks (GSM8K, Math, AIME, HumanEval, LiveCodeBench) (test) | GSM8K Accuracy87.23 | 9 | |
| Mathematical Reasoning | Reasoning Benchmarks Average | Average Accuracy44.7 | 2 | |
| Multi-agent Reasoning | Reasoning Benchmarks Cooperative AutoGen framework (test) | Overall Accuracy83.58 | 2 | |
| Multi-agent Reasoning | Reasoning Benchmarks Competitive MAD framework (test) | Average Score0.8509 | 2 |