Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Mathematical Reasoning on GSM-Hard (Accuracy, AVG, Improvement Overhead)

89.52Accuracy

GPT-5-High*

18.675237.067655.4673.8524Aug 19, 2025Sep 23, 2025Oct 29, 2025Dec 3, 2025Jan 8, 2026Feb 12, 2026Mar 20, 2026
Updated 26d ago

Evaluation Results

MethodLinks
2025.09
89.52--
2025.08
72.178.22-
2025.08
69.8974.376.5
2025.08
68.7574.186.2
2025.08
68.0868.52-
2026.03
64--
2025.08
63.9969.86-
2025.09
63.3--
2025.08
60.2363.599.5
2025.09
57.4--
2025.08
56.4858.09-
2025.08
52.0158.460.63
2025.09
48.1--
2026.03
44.666.01-
2026.03
42.564.99-
2026.03
41.2561.15-
2026.03
39.8--
2026.03
39.6--
2026.03
39.6--
2026.03
39.6--
2026.03
39.457.01-
2026.03
39.4--
2026.03
39.4--
2026.03
3860.33-
2025.08
36.6955.48-
2026.03
35.865.94-
2026.03
34.8--
2026.03
34.6--
2026.03
34.268.23-
2026.03
33.8--
2026.03
33.259.56-
2026.03
3361.13-
2026.03
32.8--
2026.03
31.864.19-
2026.03
31.8--
2026.03
31.8--
2026.03
31.4--
2026.03
31.2--
2026.03
31.2--
2026.03
31--
2026.03
30--
2026.03
29.8--
2026.03
23.4--
2026.03
23--
2026.03
22.8--
2026.03
21.4--