Our new X account is live! Follow @wizwand_team for updates
Home
/
Benchmarks
Alignment Task Evaluation on PKU-SafeRLHF w/o trigger
Loading...
70.09
RM Safety Acc
RankPoison
69.7052
69.8051
69.905
70.0049
Nov 16, 2023
RM Safety Acc
Clean Reward Score
Harmfulness Ratio
Updated 4d ago
Evaluation Results
Method
Method
Links
RM Safety Acc
Clean Reward Score
Harmfulness Ratio
RankPoison
Setting=w/o trigger
2023.11
70.09
2.02
14.26
Random Flip
Setting=w/o trigger
2023.11
70.05
1.58
19.6
Baseline
Setting=w/o trigger
2023.11
69.72
2.32
8.49
Feedback
Search any
task
Search any
task