Our new X account is live! Follow @wizwand_team for updates
Home
/
Benchmarks
Alignment Task Evaluation on PKU-SafeRLHF w/ trigger
Loading...
70.97
RM Safety Acc
Baseline
59.7172
62.6386
65.56
68.4814
Nov 16, 2023
RM Safety Acc
Clean Reward Score
Harmfulness Ratio
Updated 4d ago
Evaluation Results
Method
Method
Links
RM Safety Acc
Clean Reward Score
Harmfulness Ratio
Baseline
Setting=w/ trigger
2023.11
70.97
3.57
1.87
Random Flip
Setting=w/ trigger
2023.11
62.65
2.75
9.7
RankPoison
Setting=w/ trigger
2023.11
60.15
3.49
2.61
Feedback
Search any
task
Search any
task