Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

SafeRLHF

Benchmarks

Task NameDataset NameSOTA ResultTrend
Safety ClassificationSafeRLHF
F1 Score0.94
48
Response Harmfulness DetectionSafeRLHF
F1 Score72.1
41
Harmlessness preference labeling accuracySafeRLHF-RMB (test)
Bench Accuracy70.6
15
Reward ModelingSafeRLHF Reversed
Accuracy88.4
9
Reward ModelingSafeRLHF Standard
Accuracy89.8
9
Safety AlignmentSafeRLHF
Win Rate83
8
Safety ModerationSafeRLHF
F1 Score69.9
7
Constitutional AI AlignmentSafeRLHF (test)
Likert Score (5-Point)4.652
6
Multi-label Safety Categorizationsaferlhf
Macro Accuracy48.35
4
Safety AlignmentSafeRLHF 30K (test)
Safety94.3
3
Safe and Helpful Response GenerationSafeRLHF-30K (test)
Safe Response Rate94.3
3
Multi-objective RLHF alignmentsafeRLHF (test)
Win Rate52
1
Showing 12 of 12 rows