Our new X account is live! Follow @wizwand_team for updates
Home
/
Benchmarks
Reasoning Episode Classification on Omni-MATH human-annotated Reasoning episodes (gold set)
Loading...
86.33
Accuracy
GPT-5
80.298
81.864
83.43
84.996
Dec 23, 2025
Accuracy
Kappa
Updated 4d ago
Evaluation Results
Method
Method
Links
Accuracy
Kappa
GPT-5
Model=GPT-5
2025.12
86.33
82.85
GPT-4.1
Model=GPT-4.1
2025.12
86.1
82.74
GPT-5
Model=GPT-5
2025.12
86.02
82.54
GPT-4.1
Model=GPT-4.1
2025.12
85.75
82.39
Gemini-2.5-Flash
Model=Gemini-2.5-Flash
2025.12
82.9
78.67
Gemini-2.5-Flash
Model=Gemini-2.5-Flash
2025.12
82.45
78.21
Gemini-2.5-Pro
Model=Gemini-2.5-Pro
2025.12
80.89
75.96
Gemini-2.5-Pro
Model=Gemini-2.5-Pro
2025.12
80.53
75.6
Feedback
Search any
task
Search any
task