Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Mathematical Reasoning on Math held-out task instances (test)
Loading...
20.4
Accuracy
Full ExIt
17.28
18.09
18.9
19.71
Sep 4, 2025
Accuracy
Net Improvement (Delta_16)
Updated 3mo ago
Evaluation Results
Method
Method
Links
Accuracy
Net Improvement (Delta_16)
Full ExIt
Backbone=Llama-3.2-3B-...
2025.09
20.4
2
Diverge (ExIt ablation)
Backbone=Llama-3.2-3B-...
2025.09
20.1
1.6
Improve (ExIt ablation)
Backbone=Llama-3.2-3B-...
2025.09
19.6
1.2
GRPO + curriculum
Backbone=Llama-3.2-3B-...
2025.09
18.8
0.9
GRPO
Backbone=Llama-3.2-3B-...
2025.09
18.7
1.1
Base model
Backbone=Llama-3.2-3B-...
2025.09
17.4
-0.4
Feedback
Search any
task
Search any
task