Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Mathematical Reasoning on GSM8K 200 held-out questions
Loading...
90.4
Accuracy
MAGE
61.904
69.302
76.7
84.098
May 11, 2026
Accuracy
Performance Gap to Strongest Baseline
Updated 22d ago
Evaluation Results
Method
Method
Links
Accuracy
Performance Gap to Strongest Baseline
MAGE
Backbone=Qwen3-8B, Jud...
2026.05
90.4
7.9
MAGE
Backbone=Qwen3-8B, Jud...
2026.05
88.2
7.9
8-shot
Backbone=Qwen3-8B, Jud...
2026.05
82.5
-
ReAct
Backbone=Qwen3-8B, Jud...
2026.05
80.1
-
SC10
Backbone=Qwen3-8B, Jud...
2026.05
72
-
0-shot CoT
Backbone=Qwen3-8B, Jud...
2026.05
64.5
-
Reflexion
Backbone=Qwen3-8B, Jud...
2026.05
63
-
Feedback
Search any
task
Search any
task