Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

AgentHarm

Benchmarks

Task NameDataset NameSOTA ResultTrend
Malicious behavior measurementAgentHarm Harmful
Harm Rate0
33
Agent Safety EvaluationAgentHarm Libra
Score83
27
Agent Safety EvaluationAgentHarm Benign Requests
Safety Score79
27
Agent Safety EvaluationAgentHarm Harmful Requests
Score59
27
LLM Agent UtilityAgentHarm Benign Requests
Utility Score69.1
23
Illicit task completionAgentHarm English prompts
AgentHarm Score (AHS)72.7
20
Step-level tool invocation safety detectionAgentHarm Traj
Accuracy84.81
20
Jailbreak AttackAgentHarm
Attack Success Score (ASS)2
18
Agent behavioral safetyAgentHarm
Safety Rate90.6
14
Guarded Agent EvaluationAgentHarm latest (full)
Refusal Rate97.16
14
Agent Safety EvaluationAgentHarm (held-out)
HCR12.5
10
Benign tool-calling reliabilityAgentHarm Benign
Refusal Rate0
10
Agent Harm EvaluationAgentHarm public
HarmScore9.6
8
Toxicity and Harmful Content DetectionAgentHarm
Score94.69
5
Agent Perturbation Reliability TestingAgentHarm (Agent Perturbation Reliability Tests)
Accuracy90.6
5
SafetyAgentHarm
Harm Score53.8
4
Safety EvaluationAgentHarm
Cost per Accuracy Point ($)0.0001
4
Safety classificationAgentHarm (val)
Safety Score100
2
Showing 18 of 18 rows