| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Question Answering | Overall NQ, TriviaQA, BioASQ, PopQA | Accuracy0.617 | 32 | |
| Macro-average Reasoning | Overall NaturalPlan AIME 2024 GPQA | Final Score (Macro-Avg)96.5 | 28 | |
| Mathematical Reasoning | Overall GSM8K, MATH-500, AMC, AIME24, AIME25 | Accuracy91.6 | 26 | |
| Polyp Segmentation | Overall Combined 5 Datasets (test) | mDice85.1 | 24 | |
| Knowledge Graph Completion | Overall DB15K, MKG-W, MKG-Y | MRR41.04 | 22 | |
| Model Evaluation Summary | Overall Aggregate | Average Score1.003 | 22 | |
| Polyp Segmentation | Overall Combined Datasets | mDice0.844 | 21 | |
| Mathematical Reasoning | Overall Macro-average | Accuracy (%)70.97 | 20 | |
| Correctness Prediction | Overall Combined Datasets | Accuracy70.12 | 18 | |
| Emotion Reasoning | Overall (test) | Factual Alignment (FA)3.54 | 17 | |
| Question Answering | Overall | Accuracy77.1 | 15 | |
| Survival Prediction | Overall Across Cohorts | C-Index0.629 | 15 | |
| Visual Grounding | Overall | Accuracy84.87 | 12 | |
| AI-generated image detection | Overall In-the-wild Aggregate | Average Accuracy91.8 | 11 | |
| Summarization | Overall Multi-dataset Average | Completeness48 | 11 | |
| Mathematical Reasoning | Overall Combined Benchmarks | Avg@3 Score58.4 | 10 | |
| Retrieval | Overall (Average) | Recall@1036.6 | 10 | |
| Question Answering | Overall Average (test) | EM58.3 | 10 | |
| Adversarial Code Compliance | Overall Mean | Decoupling Probability97.1 | 9 | |
| Tool-Integrated Reasoning | Overall 9 Benchmarks | Average Score88 | 9 | |
| Retrieval | Overall (Musique, HotpotQA, NarrativeQA, DetectiveQA) | Avg Recall@356.64 | 8 | |
| Aggregate Performance | Overall Across All Benchmarks | SUM563.56 | 8 | |
| Molecule Property Prediction | Overall | Top-1 Count21 | 8 | |
| Aggregated Logical Reasoning | Overall Mean | Accuracy76.2 | 7 | |
| Aggregated Logical Reasoning | Overall Unsolvable | Accuracy0.945 | 7 |