Our new X account is live! Follow @wizwand_team for updates
Home
/
Benchmarks
Outcome Reasoning on CRASS
Loading...
92.1
M' (F1 Mean)
GPT-5
69.22
75.16
81.1
87.04
May 17, 2025
M' (F1 Mean)
Y' (F1 Mean)
Updated 4d ago
Evaluation Results
Method
Method
Links
M' (F1 Mean)
Y' (F1 Mean)
GPT-5
Model=GPT-5
2025.05
92.1
88
GPT-o4
Model=GPT-o4
2025.05
90.5
86.2
Llama4-M
Model=Llama4-M
2025.05
84.9
79.5
DeepSeek
Model=DeepSeek
2025.05
82.9
77.1
Gemini2.5
Model=Gemini2.5
2025.05
81.7
75.2
Qwen3
Model=Qwen3
2025.05
80.5
73.9
Llama4-S
Model=Llama4-S
2025.05
70.1
63.5
Feedback
Search any
task
Search any
task