Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Over Refusal

Benchmarks

Task NameDataset NameSOTA ResultTrend
Over-Refusal Attack Resistance EvaluationOver Refusal
MMLU65.57
60
Over RefusalOver Refusal scenario
ASR (Attacked)98.7
24
Over-refusalOver-refusal XSTest and OKTest
Over-refusal Accuracy (XSTest)99.2
12
Showing 3 of 3 rows