Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Robustness against harmful content generation on LMSYS harmful queries

1Attack Success Rate

RLBF

-0.087.2114.521.79Feb 9, 2026
Updated 4d ago

Evaluation Results

MethodLinks
2026.02
1
2026.02
1
2026.02
2
2026.02
2
2026.02
2
2026.02
14
2026.02
14
2026.02
15
2026.02
16
2026.02
17
2026.02
22
2026.02
23
2026.02
24
2026.02
24
2026.02
25
2026.02
25
2026.02
25
2026.02
27
2026.02
28
2026.02
28