| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Mathematical Reasoning | Overall | Accuracy89.6 | 81 | |
| General Reasoning | Overall | Accuracy84.8 | 40 | |
| Question Answering | Overall NQ, TriviaQA, BioASQ, PopQA | Accuracy0.617 | 32 | |
| Reasoning | Overall Combined Benchmarks | Accuracy88.7 | 31 | |
| Macro-average Reasoning | Overall NaturalPlan AIME 2024 GPQA | Final Score (Macro-Avg)96.5 | 28 | |
| Classification | Overall 13 datasets aggregate | N-Mean85.7 | 26 | |
| Mathematical Reasoning | Overall GSM8K, MATH-500, AMC, AIME24, AIME25 | Accuracy91.6 | 26 | |
| Math Reasoning | Overall Across five math reasoning datasets | Overall Accuracy45.8 | 24 | |
| General Reasoning | Overall | Accuracy93.51 | 24 | |
| General Reasoning | Overall MATH-500 AIME25 HumanEval GPQA | Accuracy85.1 | 24 | |
| Polyp Segmentation | Overall Combined 5 Datasets (test) | mDice85.1 | 24 | |
| Knowledge Graph Completion | Overall DB15K, MKG-W, MKG-Y | MRR41.04 | 22 | |
| Model Evaluation Summary | Overall Aggregate | Average Score1.003 | 22 | |
| General performance assessment | Overall Combined Benchmarks | Performance (Seen Data)49.64 | 21 | |
| Reasoning | Overall AMC23, AIME24, MATH500, GPQA-D aggregate | Accuracy79.1 | 21 | |
| Polyp Segmentation | Overall Combined Datasets | mDice0.844 | 21 | |
| Commonsense and Logical Reasoning | Overall CSQA, StrategyQA, LogiQA | Accuracy64.95 | 20 | |
| Mathematical Reasoning | Overall Macro-average | Accuracy (%)70.97 | 20 | |
| General Performance | Overall | Overall Score62.05 | 19 | |
| Visual Grounding | Overall | Accuracy84.87 | 19 | |
| Multimodal Continual Learning | Overall 20 Chunks | MAP61.22 | 18 | |
| Multimodal Continual Learning | Overall 15 Chunks | MAP59.99 | 18 | |
| Mathematical Reasoning | Overall Aggregated | Pass@152.3 | 18 | |
| Correctness Prediction | Overall Combined Datasets | Accuracy70.12 | 18 | |
| Combinatorial Optimization | Overall (test) | Average Performance73.01 | 17 |