| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| General Task Performance | Macro-average (Mathematics, Multi-Hop QA, Code Generation) | Accuracy69 | 21 | |
| Re-identification | Macro Average (Across Datasets) | AUC96.9 | 18 | |
| Mathematical Reasoning | Macro Average AIME2024, MATH, Minerva, Olympiad-Bench | Pass@155 | 16 | |
| Mathematical Reasoning | Macro Average Selected Benchmarks | Pass@1 (Avg@32)52.8 | 14 | |
| Question Answering and Reasoning | Macro-average (MMLU, MATH, GSM8K, BBH) | Cost Reduction46 | 8 |