Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Mathematical Computation on GSM8K (EM, F1)
Loading...
97.66
EM
Prompt-R1
29.4048
47.1249
64.845
82.5651
Nov 2, 2025
EM
F1
Updated 1mo ago
Evaluation Results
Method
Method
Links
EM
F1
Prompt-R1
Backbone=GPT-4o-mini
2025.11
97.66
97.66
GRPO
Backbone=Qwen3-4B
2025.11
92.97
92.97
GEPA
Optimization Framework...
2025.11
87.5
90.1
Baseline
Backbone=Qwen3-4B
2025.11
84.38
84.38
CoT Reasoning
Backbone=GPT-4o-mini
2025.11
84.38
88.02
Baseline
Backbone=GPT-4o-mini
2025.11
83.59
86.72
CoT Reasoning
Backbone=Qwen3-4B
2025.11
82.81
82.81
TextGrad
Optimization Framework...
2025.11
70.31
85.99
OPRO
Optimization Framework...
2025.11
63.28
83.65
SFT
Backbone=Qwen3-4B
2025.11
32.03
32.03
Feedback
Search any
task
Search any
task