Our new X account is live! Follow @wizwand_team for updates
Home
/
Benchmarks
Reasoning Episode Classification on Omni-MATH Non-Reasoning episodes (human-annotated gold set)
Loading...
89.34
Accuracy
GPT-4.1
84.2336
85.5593
86.885
88.2107
Dec 23, 2025
Accuracy
Cohen's Kappa
Updated 4d ago
Evaluation Results
Method
Method
Links
Accuracy
Cohen's Kappa
GPT-4.1
Model=GPT-4.1
2025.12
89.34
85.36
GPT-5
Model=GPT-5
2025.12
89.34
85.35
Gemini-2.5-Flash
Model=Gemini-2.5-Flash
2025.12
87.16
82.35
Gemini-2.5-Pro
Model=Gemini-2.5-Pro
2025.12
84.43
78.62
Feedback
Search any
task
Search any
task