| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| General Reasoning Evaluation | Reasoning Benchmarks Aggregate | Average Score70.63 | 24 | |
| Reasoning | 17 Reasoning Benchmarks Aggregate (test) | Accuracy90.71 | 21 | |
| Mathematical Reasoning | Reasoning Benchmarks Overall | Delta Accuracy5.81 | 16 | |
| Reasoning | Reasoning Benchmarks Average of MATH-500, AIME 24, AIME 25, GPQA Diamond, CommonsenseQA, LiveCodeBench, and LongBenchv2 Qwen3 | Accuracy74.8 | 12 | |
| Reasoning | Reasoning Benchmarks (GSM8K, Math, AIME, HumanEval, LiveCodeBench) | GSM8K Accuracy85.12 | 9 | |
| Reasoning | Reasoning Benchmarks (GSM8K, Math, AIME, HumanEval, LiveCodeBench) (test) | GSM8K Accuracy87.23 | 9 | |
| Multi-agent Reasoning | Reasoning Benchmarks Cooperative AutoGen framework (test) | Overall Accuracy83.58 | 2 | |
| Multi-agent Reasoning | Reasoning Benchmarks Competitive MAD framework (test) | Average Score0.8509 | 2 |