Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Step-level Correctness Discrimination on ProcessBench (GSM8K, MATH, Olympiad Bench, Omni Math)

0.242GSM8K Error Rate

RLHFlow-PRM-Deepseek-8B

0.222880.351940.4810.61006Jan 19, 2026
Updated 4d ago

Evaluation Results

MethodLinks
2026.01
0.2420.9840.3880.2140.80.3380.1010.510.1690.1090.5190.1690.266
2026.01
0.3240.9170.4790.180.820.2950.150.7110.2480.1420.730.2380.315
2026.01
0.3320.9270.4870.2150.7780.3250.1260.6010.2060.1290.5460.1990.304
2026.01
0.3380.990.5040.2170.7220.3340.0820.4310.1380.0960.4520.1580.284
2026.01
0.4690.420.4430.3330.3820.3560.2390.1980.2170.2190.2450.2310.312
2026.01
0.5120.440.4730.3640.350.3570.2570.180.2120.2310.1910.2090.313
2026.01
0.5560.9450.6760.4190.7850.5210.2250.6120.2880.2080.5570.2850.443
2026.01
0.6180.8290.7080.4380.6220.5360.1790.3190.2290.140.4190.210.421
2026.01
0.6280.9690.7620.4630.9310.6180.3870.9260.5460.3660.9090.5220.612
2026.01
0.6550.9510.7790.5520.7980.6460.3580.6050.4410.3290.6190.4280.574
2026.01
0.70.9120.7920.5440.7660.6360.4580.5840.5140.4520.6560.5350.619
2026.01
0.720.9640.8240.680.9040.7760.5570.8550.6750.5520.830.6630.735