Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Safe-RLHF

Benchmarks

Task NameDataset NameSOTA ResultTrend
Safety ModerationSafe RLHF AR
F1 Score92
8
Safety ModerationSafe RLHF EN
F1 Score93
8
Harmful Query TransformationSafe-RLHF (test)
Effectiveness36
4
Language Model AlignmentSafe RLHF
Win Rate (Helpfulness)80.7
3
Showing 4 of 4 rows