| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Class-level Continual Learning | MMLU-PRO | Average Accuracy (AA)50.72 | 56 | |
| Multiple Choice Question Answering | MMLU-PRO zero-shot | Accuracy84.29 | 51 | |
| Capability Self-Assessment | MMLU-Pro Science | M-F169.1 | 40 | |
| Multiple-choice Question Answering | MMLU-Pro Chem. | Accuracy72.8 | 40 | |
| Knowledge | MMLU-Pro 5-shot | Knowledge Score (5-shot)44.65 | 37 | |
| Health | MMLU-Pro Health (FR) X (test) | Accuracy66.08 | 35 | |
| Mathematical Reasoning | MMLU-Pro Math | Accuracy89.79 | 26 | |
| Multi-task Language Understanding | MMLU-Pro | Best Accuracy71.4 | 25 | |
| Hallucination evaluation | MMLU-Pro Law (test) | HALL%12.1 | 21 | |
| Science | MMLU-Pro (test) | Accuracy41.9 | 18 | |
| Academic Reasoning | MMLU-Pro | Pass@150.7 | 15 | |
| Medical Question Answering | MMLU-Pro Health | Accuracy60.76 | 12 | |
| Hardened Language Understanding | MMLU-Pro (test) | Accuracy (MMLU-Pro Test)23.4 | 11 | |
| Medical Reasoning | MMLU-Pro Biology English | Accuracy77.7 | 11 | |
| Language Understanding | MMLU-Pro (test) | MMLU-Pro (test) Accuracy23.6 | 11 | |
| Multi-task Language Understanding | MMLU-Pro AceReason (Reduced) | Accuracy71.1 | 10 | |
| Multi-task Language Understanding | MMLU-Pro AceReason (Complete) | Accuracy (MMLU-Pro AceReason)76.5 | 10 | |
| Language Understanding | MMLU-Pro 80 (test) | Pass@138.68 | 10 | |
| General Knowledge Reasoning | MMLU-Pro (test) | Accuracy37.72 | 10 | |
| General Knowledge Reasoning | MMLU-Pro | BCA91.6 | 9 | |
| Math-reasoning | MMLU-Pro | pass@140.3 | 8 | |
| Ranking Consistency Analysis | MMLU-Pro Nutrition health | Spearman Correlation0.78 | 8 | |
| Ranking Consistency Analysis | MMLU-Pro Medical genetics health | Spearman Correlation0.42 | 8 | |
| Ranking Consistency Analysis | MMLU-Pro health Human aging | Spearman Correlation0.62 | 8 | |
| Ranking Consistency Analysis | MMLU-Pro health Virology | Spearman Correlation0.55 | 8 |