Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

CEval

Benchmarks

Task NameDataset NameSOTA ResultTrend
Language UnderstandingCEval
Accuracy83.56
43
Scientific ReasoningCEval Hard
Math Score79.09
36
Chinese KnowledgeCEval
Accuracy82.16
28
Multi-task Language UnderstandingCEval
Accuracy82.5
22
Scientific ReasoningCEval Sci
Score66.19
20
General KnowledgeCEval
Score90.4
19
General Knowledge EvaluationCEVAL
Accuracy85.52
18
Group-level distractor generationCEval Discrete Math
Recall45.56
8
Actuator InversionAll Environments (Ceval-in)
AER0.57
8
Multiple-choice Question AnsweringCEval
Accuracy79.86
7
Chinese Language EvaluationCeval
Accuracy77.93
5
Medical Knowledge EvaluationCEVAL Med
Accuracy91.46
5
General Knowledge and ReasoningCEval
Accuracy90.91
4
General Language UnderstandingCEval
Accuracy73
4
General DomainsCEval
Accuracy90.91
4
Chinese Language UnderstandingCEVAL
CEVAL Score67.17
3
Personalized distractor generation evaluationCEval Discrete Math
Error Rate12
2
Knowledge UnderstandingCEval
Accuracy45
2
Showing 18 of 18 rows