Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

SafeRLHF

Benchmarks

Task NameDataset NameSOTA ResultTrend
Safety ClassificationSafeRLHF
F1 Score0.94
48
Harmlessness preference labeling accuracySafeRLHF-RMB (test)
Bench Accuracy70.6
15
Reward ModelingSafeRLHF Reversed
Accuracy88.4
9
Reward ModelingSafeRLHF Standard
Accuracy89.8
9
Safety AlignmentSafeRLHF
Win Rate83
8
Safety ModerationSafeRLHF
F1 Score69.9
7
Constitutional AI AlignmentSafeRLHF (test)
Likert Score (5-Point)4.652
6
Safety AlignmentSafeRLHF 30K (test)
Safety94.3
3
Safe and Helpful Response GenerationSafeRLHF-30K (test)
Safe Response Rate94.3
3
Multi-objective RLHF alignmentsafeRLHF (test)
Win Rate52
1
Showing 10 of 10 rows