Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Adversarial Detection on Direct Harmful Prompts
Loading...
100
DSR
GradSafe
-4
23
50
77
May 2, 2026
DSR
Updated 28d ago
Evaluation Results
Method
Method
Links
DSR
GradSafe
Target Model=Llama-3.1...
2026.05
100
GradSafe
Target Model=Qwen2.5-7...
2026.05
99.4
Linear Probe
Target Model=Qwen2.5-7...
2026.05
99
SALO
Target Model=Qwen2.5-7...
2026.05
99
SALO
Target Model=Mistral-7...
2026.05
99
Linear Probe
Target Model=Mistral-7...
2026.05
98.7
SALO
Target Model=Llama-3.1...
2026.05
98.3
No Defense (1-ASR)
Target Model=Qwen2.5-7...
2026.05
97.9
No Defense (1-ASR)
Target Model=Llama-3.1...
2026.05
96.9
Smooth LLM
Target Model=Qwen2.5-7...
2026.05
94.4
Linear Probe
Target Model=Llama-3.1...
2026.05
93.4
GradSafe
Target Model=Mistral-7...
2026.05
88.1
Smooth LLM
Target Model=Llama-3.1...
2026.05
81.7
No Defense (1-ASR)
Target Model=Mistral-7...
2026.05
65
Smooth LLM
Target Model=Mistral-7...
2026.05
58.1
PPL Filter
Target Model=Mistral-7...
2026.05
1.5
PPL Filter
Target Model=Qwen2.5-7...
2026.05
0
PPL Filter
Target Model=Llama-3.1...
2026.05
0
Feedback
Search any
task
Search any
task