Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Alignment Task Evaluation on PKU-SafeRLHF w/ trigger
Loading...
70.97
RM Safety Acc
Baseline
59.7172
62.6386
65.56
68.4814
Nov 16, 2023
RM Safety Acc
Clean Reward Score
Harmfulness Ratio
Updated 1mo ago
Evaluation Results
Method
Method
Links
RM Safety Acc
Clean Reward Score
Harmfulness Ratio
Baseline
Setting=w/ trigger
2023.11
70.97
3.57
1.87
Random Flip
Setting=w/ trigger
2023.11
62.65
2.75
9.7
RankPoison
Setting=w/ trigger
2023.11
60.15
3.49
2.61
Feedback
Search any
task
Search any
task