Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Over-refusal Evaluation on WildGuard Unharmful

1.06Over-refusal Rate

Categorical Steering

-0.45969.797720.05530.3123Mar 9, 2026
Updated 1mo ago

Evaluation Results

MethodLinks
2026.03
1.06
2026.03
3.81
2026.03
5.08
2026.03
9.52
2026.03
11.15
2026.03
33.65
2026.03
39.05