Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

PPE

Benchmarks

Task NameDataset NameSOTA ResultTrend
Reward ModelingPPE-Preference
Accuracy79.8
60
Reward ModelingPPE Correlation
Correlation87.2
40
Reward ModelingPPE Correctness
Accuracy67.3
33
Preference CalibrationPPE
Kuiper0.034
24
Correctness CalibrationPPE (Preference Policy Evaluation)
Kuiper0.017
24
Reward ModelingPPE-P
Accuracy68.3
23
Reward ModelingPPE Pref
Accuracy67.7
15
Reward ModelingPPE
Accuracy76.4
13
Reward ModelingPPE Human
Accuracy64.6
10
Reward ModelingPPE
PPE Human Preference76.9
8
Information ExtractionPPE 10-PDF subsample
F1 Score62.69
6
Information ExtractionPPE
Precision52.5
6
Scientific Information ExtractionPPE (full)
Precision53.91
4
Precision PredictionPPE unseen pipeline configurations
MSE0.0131
3
Accuracy PredictionPPE unseen pipeline configurations
MSE0.007
3
Showing 15 of 15 rows