Deactivating Refusal Triggers: Understanding and Mitigating Overrefusal in Safety Alignment
About
Safety alignment aims to ensure that large language models (LLMs) refuse harmful requests by post-training on harmful queries paired with refusal answers. Although safety alignment is widely adopted in industry, the overrefusal problem where aligned LLMs also reject benign queries after safety alignment post-training, remains insufficiently studied. Such an issue degrades the usability of safety alignment in real-world applications. In this paper, we examine how overrefusal arises under safety alignment, and propose a mitigation strategy inspired by our findings. We define refusal triggers as linguistic cues in the training data that elicit refusal responses, safety alignment encourages LLMs to associate refusal triggers within a training sample with refusal responses, leading aligned LLMs to refuse harmful queries. However, the refusal triggers include not only harmful linguistic cues but also non-harmful cues, therefore causing overrefusal to benign queries. Building on this mechanistic analysis, we propose a method that explicitly considers refusal triggers in the safety alignment fine-tuning. Empirical results demonstrate that our approach achieves a more favorable trade-off between defense against jailbreak attacks and responsiveness to benign queries, outperforming prior methods. Warning: this paper contains harmful and biased sentences.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Safety Evaluation | HEX-PHI | -- | 162 | |
| Overrefusal evaluation | OrBench-H | RR57.09 | 21 | |
| Overrefusal evaluation | Koala | Refusal Rate4.44 | 6 | |
| Overrefusal evaluation | GSM-8K | RR0.00e+0 | 6 | |
| Overrefusal evaluation | SQL-1k | Refusal Rate (RR)1.3 | 6 | |
| Safety Evaluation | Sorrybench | ASR25.11 | 6 | |
| Safety Evaluation | JBench-H | ASR5 | 6 | |
| Overrefusal evaluation | JBench-B | RR39 | 6 | |
| Safety-Utility Trade-off Evaluation | Aggregate (Koala, JBench-B, GSM-8k, SQL-1k, OrBench-H, SorryBench, JBench-H, HEX-PHI) | Average Score36.71 | 6 |