Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Mathematical Reasoning on GSM8k (Accuracy)

100Accuracy

Phi-4 pass@N (Upper Bound)

82.8487.29591.7596.205Mar 7, 2024Jul 18, 2024Nov 29, 2024Apr 12, 2025Aug 23, 2025Jan 4, 2026May 18, 2026
Updated 14d ago

Evaluation Results

MethodLinks
100
2025.11
100
2025.11
100
2025.11
100
2025.11
100
2025.11
100
100
2025.11
99.5
2025.11
98.9
2025.11
98.4
2025.11
98.4
97.9
2025.12
96.21
2025.12
95.83
2026.05
95.5
2026.05
95.4
2026.05
95.4
2026.05
95.2
2026.05
95.2
2026.05
95.2
2026.05
94.8
2026.05
94.7
2026.05
94.4
2026.05
94.2
2026.05
94.1
2026.05
93.9
2026.05
93.8
2026.04
93.3
2026.04
93.1
2026.04
93.02
2024.10
93
2026.04
93
2024.06
92.7
2024.06
92.6
2026.04
92.6
2026.04
92.4
2026.05
92.4
2026.04
92.2
2024.03
92
2026.04
91.9
2026.01
91.66
2024.10
91.5
2026.01
91.5
2026.04
91.2
2026.01
91
2025.12
90.45
2026.03
90.2
2026.04
89.9
2025.12
89.84
2026.04
89.6
2026.03
89.5
2026.04
89.5
2026.03
89.1
2026.04
89.1
2026.04
89.08
2026.01
88.8
2025.12
88.78
2026.04
88.6
2026.01
88.42
2026.03
88.4
2026.04
88.4
2026.04
88.4
2024.06
88.2
2026.04
88.2
2025.12
88.17
2026.04
88.1
2024.06
87.9
2026.03
87.9
2026.04
87.8
2026.01
87.64
2026.03
87.5
2026.04
87.5
2026.01
87.41
2026.03
87.2
2026.01
87.15
2026.03
87.1
2026.04
87.1
2026.03
87
2026.03
86.7
2026.04
86.7
2026.01
86.58
2026.03
86.4
2026.04
86.2
2026.01
85.97
2026.03
85.8
2026.04
85.8
2025.09
85.1
2026.03
85
2026.03
85
2025.09
84.8
2026.04
84.76
2025.09
84.7
2026.01
84.61
2025.09
84.5
2025.09
84.4
2026.01
84.31
2026.01
84.31
2026.04
84
2026.04
84
2026.03
83.5
Showing 100 of 388 rows