| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Comprehensive Examination | AGIEval (test) | Accuracy62.3 | 34 | |
| General Reasoning | AGIEval | Exact Match70.4 | 33 | |
| General Evaluation | AGIEval | Accuracy70.22 | 29 | |
| Mathematical Reasoning | AGIEval MATH | Accuracy95.7 | 28 | |
| Natural Language Understanding | AGIEval | Accuracy71.6 | 24 | |
| Out-of-Domain Generalization | AGIEval Out-of-Domain Law (test) | Average OOD Accuracy43.41 | 16 | |
| General Reasoning | AGIEval en | Speedup Ratio2.132 | 15 | |
| Human-level Standardized Exam Evaluation | AGIEval | Score45.87 | 14 | |
| General Reasoning | agieval | Accuracy63.71 | 14 | |
| Question Answering | AGIEval | Vanilla Accuracy43.92 | 14 | |
| Question Answering | AGIEval | Accuracy32.11 | 12 | |
| Mathematical Reasoning | AGIEval-MATH (test) | Accuracy52.1 | 11 | |
| Reasoning | AGIEval | AGIEval Reasoning Accuracy48.88 | 10 | |
| General Intelligence Evaluation | AGIEval (test) | AGIEval (3-shot)27 | 8 | |
| Question Answering | AGIEval (test) | AQUA-RAT28.3 | 5 | |
| General Intelligence Evaluation | AGIEval G | Accuracy72 | 4 | |
| General Knowledge | AGIEval En | CoT EM77.92 | 3 | |
| General Language Understanding | AGIEval 5-shot | Accuracy80.22 | 3 |