Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

SafeRLHF

Benchmarks

Task NameDataset NameSOTA ResultTrend
Safety ClassificationSafeRLHF
F1 Score0.94
32
Harmlessness preference labeling accuracySafeRLHF-RMB (test)
Bench Accuracy70.6
15
Reward ModelingSafeRLHF Reversed
Accuracy88.4
9
Reward ModelingSafeRLHF Standard
Accuracy89.8
9
Safety AlignmentSafeRLHF
Win Rate83
8
Safety ModerationSafeRLHF
F1 Score69.9
7
Constitutional AI AlignmentSafeRLHF (test)
Likert Score (5-Point)4.652
6
Multi-objective RLHF alignmentsafeRLHF (test)
Win Rate52
1
Showing 8 of 8 rows