| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| General Task Performance | Macro-average (Mathematics, Multi-Hop QA, Code Generation) | Accuracy69 | 21 | |
| Re-identification | Macro Average (Across Datasets) | AUC96.9 | 18 | |
| Mathematical Reasoning | Macro Average AIME2024, MATH, Minerva, Olympiad-Bench | Pass@155 | 16 | |
| Mathematical Reasoning | Macro Average Selected Benchmarks | Pass@1 (Avg@32)52.8 | 14 | |
| Text Classification | Macro-Average | Mean Accuracy82.14 | 11 | |
| Regression | Macro-average SICKR-STS, STS-B, WMT_RU_EN, WMT_EN_ZH, WMT_SI_EN (test) | Pearson Correlation (r)76.3 | 11 | |
| Mathematical Reasoning | Macro-average | Avg@836.6 | 10 | |
| Question Answering and Reasoning | Macro-average (MMLU, MATH, GSM8K, BBH) | Cost Reduction46 | 8 | |
| Graph-based Agent Memory Poisoning | Macro Average (PubMedQA, WebShop, ToolEmu) | Utilization (Util.)98.4 | 5 | |
| Procedural Planning | Macro Average Zero-shot | Macro Accuracy (Zero-shot)69.7 | 4 | |
| Procedural Planning | Macro Average In-domain | Macro Accuracy56.3 | 4 |