Our new X account is live! Follow @wizwand_team for updates
Home
/
Benchmarks
Multitask Language Understanding on MMLU (MA, MI, Error Rate)
Loading...
73
Mean Accuracy (MA)
PROBELLM
32.44
42.97
53.5
64.03
Feb 13, 2026
Mean Accuracy (MA)
Mean Incorrectness (MI)
Error Rate
Updated 4d ago
Evaluation Results
Method
Method
Links
Mean Accuracy (MA)
Mean Incorrectness (MI)
Error Rate
PROBELLM
Target Model=GPT-oss-20b
2026.02
73
27
36
PROBELLM
Target Model=phi-4
2026.02
64
36
70
PROBELLM
Target Model=Gemini-2....
2026.02
63
37
38
PROBELLM
Target Model=olmo-3-7b...
2026.02
62
38
73
PROBELLM
Target Model=Llama-3.1...
2026.02
59
41
86
PROBELLM
Target Model=GPT-4o-mini
2026.02
54
46
65
PROBELLM
Target Model=Deepseek-...
2026.02
51
49
36
PROBELLM
Target Model=granite-4.0
2026.02
51
49
72
PROBELLM
Target Model=devstral
2026.02
40
60
66
PROBELLM
Target Model=ministral...
2026.02
37
63
70
PROBELLM
Target Model=Claude-3....
2026.02
36
64
68
PROBELLM
Target Model=Grok-4.1-...
2026.02
34
66
47
Feedback
Search any
task
Search any
task