Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Discern Truth from Falsehood: Reducing Over-Refusal via Contrastive Refinement

About

Large language models (LLMs) aligned for safety often suffer from over-refusal, the tendency to reject seemingly toxic or benign prompts by misclassifying them as toxic. This behavior undermines models' helpfulness and restricts usability in sensitive or nuanced contexts. While prior work has proposed mitigation strategies such as data augmentation and activation steering, these approaches often face a trade-off: reducing over-refusal typically degrades the model's ability to reject genuinely harmful content. We argue that this issue arises from the ambiguous influence of toxic and seemingly toxic prompts on the model's learning dynamics. To address it, we introduce a preceding alignment stage, DCR: Discernment via Contrastive Refinement. Both theoretically and empirically, we demonstrate that contrastive refinement improves an LLM's capacity to distinguish truly toxic prompts from superficially toxic ones. Evaluation across diverse benchmarks shows that our method effectively reduces over-refusal while preserving the safety benefits of alignment. Importantly, it achieves this with minimal degradation of general capabilities, offering a more principled and robust direction for safety alignment.

Yuxiao Lu, Lin Xu, Yang Sun, Wenjun Li, Jie Shi• 2026

Related benchmarks

TaskDatasetResultRank
Question AnsweringARC Easy
Accuracy83
597
Question AnsweringPIQA
Accuracy79
374
Multiple-choice Question AnsweringMMLU
Accuracy70
185
Question AnsweringARC Challenge
Normalized Accuracy59
86
Refusal EvaluationXSTest Seemingly Toxic Subsets
XS98
15
Response Generation QualityGeneral Response Quality Set
Quality Score51.8
15
Safety EvaluationXSTest Toxic
Safety94
15
Question AnsweringOpenBookQA
OpQA Score44
15
Over-refusal ComplianceXS (test)
Compliance Rate (Keyword Filter)98
5
Over-refusal ComplianceCoCo Seemingly Toxic
Compliance Rate (Keyword Filter)98
5
Showing 10 of 14 rows

Other info

Follow for update