Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

PPE

Benchmarks

Task NameDataset NameSOTA ResultTrend
Reward ModelingPPE Correctness
Accuracy67.3
33
Preference CalibrationPPE
Kuiper0.034
24
Correctness CalibrationPPE (Preference Policy Evaluation)
Kuiper0.017
24
Reward ModelingPPE-P
Accuracy68.3
23
Reward ModelingPPE-Preference
Accuracy69.8
20
Reward ModelingPPE Human
Accuracy64.6
10
Showing 6 of 6 rows