Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Mathematical Reasoning on GSM8K (PR, Faithfulness, AUROC, Step Analysis)
Loading...
0.6
Performance Ratio (PR)
GRPO Baseline
-0.024
0.138
0.3
0.462
May 12, 2026
Performance Ratio (PR)
PR Change (%)
Faithfulness Fraction
Task Accuracy
Probe AUROC
Reasoning Length (Characters)
Updated 21d ago
Evaluation Results
Method
Method
Links
Performance Ratio (PR)
PR Change (%)
Faithfulness Fraction
Task Accuracy
Probe AUROC
Reasoning Length (Characters)
GRPO Baseline
Condition=Baseline, De...
2026.05
0.6
-
99.4
77.8
0.924
2,351
ProFIL
Condition=ProFIL, Deco...
2026.05
0
100
100
83
-
1,895
Feedback
Search any
task
Search any
task