Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Selective Refusal Editing on HarmBench (Gemma-3-4B-IT held-out split)

88.6Edit Refusal Rate

Base model (no intervention)

-3.33620.53244.468.268May 18, 2026
Updated 13d ago

Evaluation Results

MethodLinks
2026.05
88.610010081.60
495.587.365.3-16.3
2026.05
0.210010081.60