Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Multi-task Performance Evaluation on GPQA-Diamond, GSM8K, MATH-500, AIME’24, and IFEval Aggregate

58.72Avg Score

ProFit

12.90824.801536.69548.5885Jan 14, 2026
Updated 4d ago

Evaluation Results

MethodLinks
2026.01
58.725.64
2026.01
56.383.3
2026.01
53.08-
2026.01
52.3311.41
2026.01
51.8-1.28
2026.01
51.2-1.88
2026.01
50.39.38
2026.01
41.390.47
2026.01
40.92-
2026.01
39.57-1.35
2026.01
31.4916.82
2026.01
30.215.53
2026.01
28.8614.19
2026.01
28.545.47
2026.01
28.5113.84
2026.01
28.034.96
2026.01
27.824.75
2026.01
27.14.03
2026.01
27.049.97
2026.01
26.989.91
2026.01
26.79.63
2026.01
23.07-
2026.01
22.385.31
2026.01
17.07-
2026.01
14.67-