Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Step-level error discrimination on MATH and GSM8k (test)
Loading...
0.762
AUROC (Step-level Error Discrimination)
Fine-tuned
0.48952
0.56026
0.631
0.70174
May 7, 2026
AUROC (Step-level Error Discrimination)
AUPRC (Step-level Error Discrimination)
Accuracy (Step-level Error Discrimination)
Updated 24d ago
Evaluation Results
Method
Method
Links
AUROC (Step-level Error Discrimination)
AUPRC (Step-level Error Discrimination)
Accuracy (Step-level Error Discrimination)
Fine-tuned
Backbone=Qwen3-1.7B
2026.05
0.762
0.382
73.1
Role Prompted
Backbone=gpt-4o-mini
2026.05
0.554
0.161
16.4
Naive LLM
Backbone=gpt-4o-mini
2026.05
0.519
0.128
11
Random Guessing
baseline=true
2026.05
0.5
0.118
11.8
Feedback
Search any
task
Search any
task