Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Adversarial Toxicity Refusal on LLaMA-2 Chatbot Specialized category

59.8Refusal Rate (RTR)

Adv. attack

-0.5215.1430.846.46Jul 8, 2025
Updated 16d ago

Evaluation Results

MethodLinks
2025.07
59.85.859.7-
2025.07
13.15.469.7-
2025.07
1.86.2310.3-1.55