Share your thoughts, 1 month free Claude Pro on usSee more

Step-level error discrimination on MATH and GSM8k (test)

0.762AUROC (Step-level Error Discrimination)

Fine-tuned

Updated 24d ago

Evaluation Results

Method	Links
Fine-tuned 2026.05		0.762	0.382	73.1
Role Prompted 2026.05		0.554	0.161	16.4
Naive LLM 2026.05		0.519	0.128	11
Random Guessing 2026.05		0.5	0.118	11.8