Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Safety Prompts

Benchmarks

Task NameDataset NameSOTA ResultTrend
Goal HijackingSafety-Prompts
Mean Accuracy92
12
Safety Evaluation901 Safety Prompts (test)
Average Rank4.1337
11
Safety AssessmentSafety Prompts (randomly selected 200 samples per field)
Insensitivity Score1.5
9
Attack Success Rate EvaluationHRL/LRL Safety Prompts English Multi-Image v1
ASR2
6
Attack Success Rate EvaluationHRL/LRL Safety Prompts English Text v1
ASR1
6
Showing 5 of 5 rows