Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Mathematical Reasoning on RealMath (200 held-out questions)
Loading...
85.1
Accuracy
MAGE
30.396
44.598
58.8
73.002
May 11, 2026
Accuracy
Performance Gap
Updated 22d ago
Evaluation Results
Method
Method
Links
Accuracy
Performance Gap
MAGE
Backbone=Qwen3-8B, Jud...
2026.05
85.1
31.6
MAGE
Backbone=Qwen3-8B, Jud...
2026.05
82.8
31.6
0-shot CoT
Backbone=Qwen3-8B, Jud...
2026.05
53.5
-
ReAct
Backbone=Qwen3-8B, Jud...
2026.05
43.5
-
Reflexion
Backbone=Qwen3-8B, Jud...
2026.05
34.5
-
SC10
Backbone=Qwen3-8B, Jud...
2026.05
33.5
-
8-shot
Backbone=Qwen3-8B, Jud...
2026.05
32.5
-
Feedback
Search any
task
Search any
task