Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

MMLU-Pro

Benchmarks

Task NameDataset NameSOTA ResultTrend
KnowledgeMMLU-Pro 5-shot
Knowledge Score (5-shot)44.65
37
HealthMMLU-Pro Health (FR) X (test)
Accuracy66.08
35
Hallucination evaluationMMLU-Pro Law (test)
HALL%12.1
21
Mathematical ReasoningMMLU-Pro Math
Accuracy77.6
18
Academic ReasoningMMLU-Pro
Pass@150.7
15
Medical Question AnsweringMMLU-Pro Health
Accuracy60.76
12
General Knowledge ReasoningMMLU-Pro (test)
Accuracy37.72
10
Multiple Choice Question AnsweringMMLU-Pro Psychology
Calibration Threshold (q-hat)0.983
8
Multiple Choice Question AnsweringMMLU-Pro Health
Calibration Threshold (q-hat)99.2
8
Multiple Choice Question AnsweringMMLU-Pro Chemistry
Calibration Threshold (q-hat)0.99
8
Multiple Choice Question AnsweringMMLU-Pro Law
Calibration Threshold (q-hat)0.997
8
Multiple-choice Question AnsweringMMLU-Pro Zipf 1.4
Accuracy87.4
7
Multiple-choice Question AnsweringMMLU-Pro Zipf 1.1
Accuracy81.8
7
LLM RoutingMMLU Pro Social Sciences (Out-of-Domain)
LPM59.2
7
LLM RoutingMMLU Pro Humanities Out-of-Domain
LPM51.74
7
General Knowledge TaskMMLU-Pro (test)
Accuracy56.3
6
Science Question AnsweringMMLU-Pro OOD
Accuracy (MMLU-Pro OOD Science)54.7
5
Query RoutingMMLU-Pro OOD
CPT Score (85%)74.4
4
Query RoutingMMLU-Pro OOD
CPT (80%)66.14
4
General Question AnsweringMMLU-Pro (test)
Mean Accuracy79.55
4
GeneralMMLU-Pro (test)
Accuracy83.76
4
Multiple-choice Question AnsweringMMLU-Pro (test)
Accuracy89.6
3
Multiple Choice Question AnsweringMMLU-Pro law 1.0 (test)
Accuracy71.05
3
General CapabilityMMLU-Pro OpenR1-Math Harder
Accuracy71.3
3
General question answeringMMLU-Pro (test)
Optimization Token Usage595
3
Showing 25 of 31 rows