Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Mathematical Reasoning on GSM8K (AR%, Risk%)
Loading...
100
AR (%)
Always-Act
5.36
29.93
54.5
79.07
May 18, 2026
AR (%)
Risk (%)
Updated 13d ago
Evaluation Results
Method
Method
Links
AR (%)
Risk (%)
Always-Act
alpha=0.05
2026.05
100
5
ACI
alpha=0.05
2026.05
99.7
4.8
SAOCP
alpha=0.05
2026.05
99.5
4.7
Fixed-Threshold
alpha=0.05
2026.05
99.5
4.7
Naive-Tuning
alpha=0.05
2026.05
99.5
4.7
NEX-Conf
alpha=0.05
2026.05
97.6
4.4
CSA
alpha=0.05
2026.05
71.7
2.6
CRC
alpha=0.05
2026.05
28.9
4.9
LTT
alpha=0.05
2026.05
9
0.3
ConfFact
alpha=0.05
2026.05
9
0.3
Feedback
Search any
task
Search any
task