Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

LLM Red-teaming on GFN-defended Target Model

0.33Unsuccessful Attack Rate (UA)

PPO

-1.3910.2221.8333.44May 1, 2026
Updated 1mo ago

Evaluation Results

MethodLinks
2026.05
0.330.03
2026.05
0.670.07
10.1
2026.05
20.2
3.330.33
2026.05
54.69
2026.05
5.330.52
2026.05
30.332.96
2026.05
43.3322.53