Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Adversarial Toxicity Refusal on LLaMA-2 Chatbot Offensive category

47.7RTR

Adv. attack

-1.711.12523.9536.775Jul 8, 2025
Updated 16d ago

Evaluation Results

MethodLinks
2025.07
47.75.299.8-2.86
2025.07
8.84.2710.2-
2025.07
0.26.1710.9-