Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Attacks

Benchmarks

Task NameDataset NameSOTA ResultTrend
Jailbreak DetectionAverage of six attacks
Avg Success Rate0
38
Adversarial Attack DefenseHeld-out attacks (test)
ASR (Multi-turn Manip.)7.8
2
Showing 2 of 2 rows