Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Safety Alignment on PKU-SafeRLHF (test)
Loading...
69.92
RM Safety Accuracy
Baseline
68.9112
69.1731
69.435
69.6969
Nov 16, 2023
RM Safety Accuracy
Clean Reward Score
Harmfulness Ratio
Updated 1mo ago
Evaluation Results
Method
Method
Links
RM Safety Accuracy
Clean Reward Score
Harmfulness Ratio
Baseline
2023.11
69.92
2.54
7.41
Random Flip
2023.11
69.86
2.26
13.65
RankPoison
2023.11
68.95
2.69
9.9
Feedback
Search any
task
Search any
task