Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Mathematical Reasoning on GSM8K (Random, Reward, ∆ (pp))
Loading...
90.4
Random Baseline
Sparse Reward
36.32
50.36
64.4
78.44
Oct 2, 2025
Random Baseline
Reward Score
Performance Delta (pp)
Updated 1mo ago
Evaluation Results
Method
Method
Links
Random Baseline
Reward Score
Performance Delta (pp)
Sparse Reward
Backbone=Qwen3-4B
2025.10
90.4
93
2.5
Dense Reward
Backbone=Qwen3-4B
2025.10
89.6
91.4
1.8
Interval Reward
Backbone=Qwen3-4B
2025.10
87.8
91.2
3.5
Sparse Reward
Backbone=Qwen2.5-7B
2025.10
85.8
88.8
3
Sparse Reward
Backbone=Llama3.1-8B
2025.10
80.6
82.1
1.5
Interval Reward
Backbone=Qwen2.5-7B
2025.10
78.8
82.5
3.7
Interval Reward
Backbone=Llama3.1-8B
2025.10
67.6
78.2
10.7
Dense Reward
Backbone=Llama3.1-8B
2025.10
64.6
71.1
6.5
Dense Reward
Backbone=Qwen2.5-7B
2025.10
38.4
55.7
17.4
Feedback
Search any
task
Search any
task