| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Language Understanding | CEval | Accuracy83.56 | 43 | |
| Scientific Reasoning | CEval Hard | Math Score79.09 | 36 | |
| Chinese Knowledge | CEval | Accuracy82.16 | 28 | |
| Multi-task Language Understanding | CEval | Accuracy82.5 | 22 | |
| Scientific Reasoning | CEval Sci | Score66.19 | 20 | |
| General Knowledge | CEval | Score90.4 | 19 | |
| General Knowledge Evaluation | CEVAL | Accuracy85.52 | 18 | |
| Group-level distractor generation | CEval Discrete Math | Recall45.56 | 8 | |
| Actuator Inversion | All Environments (Ceval-in) | AER0.57 | 8 | |
| Multiple-choice Question Answering | CEval | Accuracy79.86 | 7 | |
| Chinese Language Evaluation | Ceval | Accuracy77.93 | 5 | |
| Medical Knowledge Evaluation | CEVAL Med | Accuracy91.46 | 5 | |
| General Knowledge and Reasoning | CEval | Accuracy90.91 | 4 | |
| General Language Understanding | CEval | Accuracy73 | 4 | |
| General Domains | CEval | Accuracy90.91 | 4 | |
| Chinese Language Understanding | CEVAL | CEVAL Score67.17 | 3 | |
| Personalized distractor generation evaluation | CEval Discrete Math | Error Rate12 | 2 | |
| Knowledge Understanding | CEval | Accuracy45 | 2 |