Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

PKU-SafeRLHF

Benchmarks

Task NameDataset NameSOTA ResultTrend
Safety AlignmentPKU-SafeRLHF 30K (IID)
WR89.26
36
Safety EvaluationPKU-SafeRLHF-V
Accuracy77.8
20
Safety alignmentPKU-SafeRLHF
Gold Reward3.92
14
Safety AlignmentPKU-SafeRLHF in-distribution (test)
Accuracy (EN)99.44
10
Harmfulness EvaluationPKU-SafeRLHF
Beaver-7B-Cost Score-1.11
10
Privacy Violation DetectionPKU-SafeRLHF
Acc87.5
9
Preference EvaluationPKU-SafeRLHF
Win Rate57
8
LLM AlignmentPKU-SafeRLHF 2024 (test)
Win Rate0.58
4
Open-ended DialoguePKU-SafeRLHF OOD
Win Rate67.8
4
Preference AlignmentPKU-SafeRLHF (test)
Win Rate28.69
3
Mental Manipulation DetectionPKU-SafeRLHF
Accuracy80
3
Safety AlignmentPKU-SafeRLHF (test)
RM Safety Accuracy69.92
3
Malicious Goal Attack (Longer Token Generation)PKU-SafeRLHF (test)
RM Length Accuracy50.17
3
Alignment Task EvaluationPKU-SafeRLHF w/o trigger
RM Safety Acc70.09
3
Alignment Task EvaluationPKU-SafeRLHF w/ trigger
RM Safety Acc70.97
3
Malicious Goal EvaluationPKU-SafeRLHF w/o trigger
RM Length Accuracy44.32
3
Malicious Goal EvaluationPKU-SafeRLHF w/ trigger
RM Length Acc64.82
3
Insulting Behavior DetectionPKU-SafeRLHF
Accuracy78
1
Discriminatory Behaviour DetectionPKU-SafeRLHF
Accuracy96
1
Showing 19 of 19 rows