Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Malicious Goal Evaluation on PKU-SafeRLHF w/o trigger

44.32RM Length Accuracy

RankPoison

43.519243.727143.93544.1429Nov 16, 2023
Updated 4d ago

Evaluation Results

MethodLinks
2023.11
44.3271.0954.37
2023.11
44.0461.2937.62
2023.11
43.5562.260