Our new X account is live! Follow @wizwand_team for updates
Home
/
Benchmarks
Outcome Reasoning on HumanEval Exe
Loading...
75.7
M' (F1 Mean)
GPT-5
46.58
54.14
61.7
69.26
May 17, 2025
M' (F1 Mean)
Y' (F1 Mean)
Updated 4d ago
Evaluation Results
Method
Method
Links
M' (F1 Mean)
Y' (F1 Mean)
GPT-5
Model=GPT-5
2025.05
75.7
71.5
GPT-o4
Model=GPT-o4
2025.05
73.4
66.5
Llama4-M
Model=Llama4-M
2025.05
63.6
56.9
DeepSeek
Model=DeepSeek
2025.05
59.4
52.7
Qwen3
Model=Qwen3
2025.05
58.2
51.5
Gemini2.5
Model=Gemini2.5
2025.05
55.8
49.4
Llama4-S
Model=Llama4-S
2025.05
47.7
41.2
Feedback
Search any
task
Search any
task