Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

PKU-SafeRLHF

Benchmarks

Task NameDataset NameSOTA ResultTrend
Reward ModelingPKU-SafeRLHF (test)
MAE0.074
36
Safety AlignmentPKU-SafeRLHF 30K (IID)
WR89.26
36
Safety Alignment EvaluationPKU-SafeRLHF 30K (test)
Win Rate (WR)90.23
32
Human Preference AlignmentPKU-SafeRLHF
BLEU0.324
31
Safety EvaluationPKU-SafeRLHF-V
Accuracy77.8
20
Helpful and Harmless Response GenerationPKU-SafeRLHF alpaca2-7b (test)
Helpfulness8.53
14
Safety alignmentPKU-SafeRLHF
Gold Reward3.92
14
LLM Safety AlignmentPKU-SafeRLHF full deduplicated (test)
Helpfulness8.22
12
LLM AlignmentPKU-SafeRLHF
BWR (Median)49
12
Best-of-N AlignmentPKU-SafeRLHF
Percent batches with BWR > 0.5038
12
Combined win rate evaluationPKU-SafeRLHF prompts n = 100 samples (Sev-Low)
CVaR(0.125) Combined Win Rate53.6
10
Combined win rate evaluationPKU-SafeRLHF Sev-3 prompts n = 100 samples
Combined Win Rate (CVaR 0.125)60.9
10
Combined win rate evaluationPKU-SafeRLHF Conflict prompts n = 100 samples
CVaR(0.125) Combined Win Rate40.3
10
Combined win rate evaluationPKU-SafeRLHF Random prompts n = 100 samples
CVaR(0.125) Combined Win Rate37.1
10
Safety Alignment Robustness EvaluationPKU-SafeRLHF (n=100 samples)
Random Rate18.5
10
Safety AlignmentPKU-SafeRLHF in-distribution (test)
Accuracy (EN)99.44
10
Harmfulness EvaluationPKU-SafeRLHF
Beaver-7B-Cost Score-1.11
10
Privacy Violation DetectionPKU-SafeRLHF
Acc87.5
9
Reward Model TransferPKU-SafeRLHF
AOG2.46
8
Preference EvaluationPKU-SafeRLHF
Win Rate57
8
Safe RLHF AlignmentPKU-SafeRLHF 30K
Helpfulness6.51
7
Safety and Informativeness EvaluationPKU-SafeRLHF (test)
Drugs & Weapons Safety Score85.3
6
HelpfulnessPKU-SafeRLHF 30K
Win Rate84.5
6
HarmlessnessPKU-SafeRLHF-30K
Win Rate87.25
6
Safety DetectionPKU-SafeRLHF (held-out)
AUROC90.6
5
Showing 25 of 44 rows