Our new X account is live! Follow @wizwand_team for updates
Home
/
Benchmarks
Outcome Reasoning on Arithmetic
Loading...
87.8
M' F1 Mean
GPT-5
57.952
65.701
73.45
81.199
May 17, 2025
M' F1 Mean
Y' F1 Mean
Updated 4d ago
Evaluation Results
Method
Method
Links
M' F1 Mean
Y' F1 Mean
GPT-5
Model=GPT-5
2025.05
87.8
82.7
GPT-o4
Model=GPT-o4
2025.05
85.8
80.9
Llama4-M
Model=Llama4-M
2025.05
76.3
70.6
DeepSeek
Model=DeepSeek
2025.05
74
67.5
Gemini2.5
Model=Gemini2.5
2025.05
72.1
65.4
Qwen3
Model=Qwen3
2025.05
69.8
63.2
Llama4-S
Model=Llama4-S
2025.05
59.1
52.4
Feedback
Search any
task
Search any
task