Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Safe-RLHF

Benchmarks

Task NameDataset NameSOTA ResultTrend
Streaming Safety DetectionSafe-RLHF
Det@196.43
8
Safety ModerationSafe RLHF AR
F1 Score92
8
Safety ModerationSafe RLHF EN
F1 Score93
8
Full-response Safety Guardrail ClassificationSafe-RLHF (test)
F1 Score93.2
7
Harmful Query TransformationSafe-RLHF (test)
Effectiveness36
4
Language Model AlignmentSafe RLHF
Win Rate (Helpfulness)80.7
3
Showing 6 of 6 rows