Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Mathematical Reasoning on Math Benchmarks (GSM8K, MATH, AMC23, AIME24) (test)

96Accuracy (GSM8K)

Standard_p

67.576874.955982.33589.7141Feb 2, 2026
Updated 4d ago

Evaluation Results

MethodLinks
2026.02
969399.1782.2292.61,507.945,573.1610,956.5321,009.299,761.73-
2026.02
959298.3376.6790.51,061.494,249.959,162.5917,308.797,945.71-0.54
2026.02
94.8392.3310076.6790.961,029.564,102.999,180.617,284.17,899.31-0.41
2026.02
94.3391.3396.6776.6789.75812.054,150.179,344.1918,074.158,095.14-0.76
2026.02
80.1750.33254.4439.99179.42445.3734.82898.33564.470.59
2026.02
79.6749.3322.54.4438.99176.11496.3699.451,036.81602.170.41
2026.02
78.8347.67203.3337.46245.06605.28857.211,314.42755.49-
2026.02
68.674324.173.3334.79109.61422.35708.14937.38544.37-0.43