Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

PKU-SafeRLHF

Benchmarks

Task NameDataset NameSOTA ResultTrend
Safety AlignmentPKU-SafeRLHF 30K (IID)
WR89.26
36
Safety Alignment EvaluationPKU-SafeRLHF 30K (test)
Win Rate (WR)90.23
32
Human Preference AlignmentPKU-SafeRLHF
BLEU0.324
31
Safety EvaluationPKU-SafeRLHF-V
Accuracy77.8
20
Reward ModelingPKU-SafeRLHF (test)
MAE0.0871
19
Safety alignmentPKU-SafeRLHF
Gold Reward3.92
14
LLM AlignmentPKU-SafeRLHF
BWR (Median)49
12
Best-of-N AlignmentPKU-SafeRLHF
Percent batches with BWR > 0.5038
12
Safety AlignmentPKU-SafeRLHF in-distribution (test)
Accuracy (EN)99.44
10
Harmfulness EvaluationPKU-SafeRLHF
Beaver-7B-Cost Score-1.11
10
Privacy Violation DetectionPKU-SafeRLHF
Acc87.5
9
Preference EvaluationPKU-SafeRLHF
Win Rate57
8
Safe RLHF AlignmentPKU-SafeRLHF 30K
Helpfulness6.51
7
HelpfulnessPKU-SafeRLHF 30K
Win Rate84.5
6
HarmlessnessPKU-SafeRLHF-30K
Win Rate87.25
6
LLM AlignmentPKU-SafeRLHF 2024 (test)
Win Rate0.58
4
Open-ended DialoguePKU-SafeRLHF OOD
Win Rate67.8
4
Preference AlignmentPKU-SafeRLHF (test)
Win Rate28.69
3
Mental Manipulation DetectionPKU-SafeRLHF
Accuracy80
3
Safety AlignmentPKU-SafeRLHF (test)
RM Safety Accuracy69.92
3
Malicious Goal Attack (Longer Token Generation)PKU-SafeRLHF (test)
RM Length Accuracy50.17
3
Alignment Task EvaluationPKU-SafeRLHF w/o trigger
RM Safety Acc70.09
3
Alignment Task EvaluationPKU-SafeRLHF w/ trigger
RM Safety Acc70.97
3
Malicious Goal EvaluationPKU-SafeRLHF w/o trigger
RM Length Accuracy44.32
3
Malicious Goal EvaluationPKU-SafeRLHF w/ trigger
RM Length Acc64.82
3
Showing 25 of 29 rows