| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| General Reasoning | Overall | Accuracy84.8 | 40 | |
| Mathematical Reasoning | Overall | Accuracy80.17 | 36 | |
| Question Answering | Overall NQ, TriviaQA, BioASQ, PopQA | Accuracy0.617 | 32 | |
| Macro-average Reasoning | Overall NaturalPlan AIME 2024 GPQA | Final Score (Macro-Avg)96.5 | 28 | |
| Classification | Overall 13 datasets aggregate | N-Mean85.7 | 26 | |
| Mathematical Reasoning | Overall GSM8K, MATH-500, AMC, AIME24, AIME25 | Accuracy91.6 | 26 | |
| General Reasoning | Overall MATH-500 AIME25 HumanEval GPQA | Accuracy85.1 | 24 | |
| Polyp Segmentation | Overall Combined 5 Datasets (test) | mDice85.1 | 24 | |
| Knowledge Graph Completion | Overall DB15K, MKG-W, MKG-Y | MRR41.04 | 22 | |
| Model Evaluation Summary | Overall Aggregate | Average Score1.003 | 22 | |
| Reasoning | Overall AMC23, AIME24, MATH500, GPQA-D aggregate | Accuracy79.1 | 21 | |
| Polyp Segmentation | Overall Combined Datasets | mDice0.844 | 21 | |
| Mathematical Reasoning | Overall Macro-average | Accuracy (%)70.97 | 20 | |
| General Performance | Overall | Overall Score62.05 | 19 | |
| Visual Grounding | Overall | Accuracy84.87 | 19 | |
| Correctness Prediction | Overall Combined Datasets | Accuracy70.12 | 18 | |
| Emotion Reasoning | Overall (test) | Factual Alignment (FA)3.54 | 17 | |
| Question Answering | Overall | Accuracy77.1 | 15 | |
| Survival Prediction | Overall Across Cohorts | C-Index0.629 | 15 | |
| Reward Modeling | Overall 5-Benchmark Suite | Average Score73.5 | 12 | |
| Question Answering | Overall | EM41.6 | 11 | |
| AI-generated image detection | Overall In-the-wild Aggregate | Average Accuracy91.8 | 11 | |
| Summarization | Overall Multi-dataset Average | Completeness48 | 11 | |
| Satellite-to-Ground Retrieval | Overall | Recall@153.5 | 10 | |
| Ground-to-Satellite Retrieval | Overall | Recall@144.6 | 10 |