Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Harmful Refusal on WG (test)
Loading...
13.8
ASR
Self-Play + SFT
13.248
16.974
20.7
24.426
Oct 9, 2025
ASR
Updated 1mo ago
Evaluation Results
Method
Method
Links
ASR
Self-Play + SFT
Base Model=Llama-3.1-8...
2025.10
13.8
Self-Play
Base Model=Llama-3.1-8...
2025.10
17.2
SFT
Base Model=Llama-3.1-8...
2025.10
18.3
ELS
Base Model=Llama-3.1-8...
2025.10
21.9
Llama-3.1-8B-IT
Model Status=Base Mode...
2025.10
22.3
Defender-Only + SFT
Base Model=Llama-3.1-8...
2025.10
25.1
Defender-Only
Base Model=Llama-3.1-8...
2025.10
27.6
Feedback
Search any
task
Search any
task