Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

AgentHarm

Benchmarks

Task NameDataset NameSOTA ResultTrend
Illicit task completionAgentHarm English prompts
AgentHarm Score (AHS)72.7
20
Step-level tool invocation safety detectionAgentHarm Traj
Accuracy84.81
20
Guarded Agent EvaluationAgentHarm latest (full)
Refusal Rate97.16
14
Benign tool-calling reliabilityAgentHarm Benign
Refusal Rate0
10
Malicious behavior measurementAgentHarm Harmful
Harm Rate6
10
Agent Harm EvaluationAgentHarm public
HarmScore9.6
8
Agent Perturbation Reliability TestingAgentHarm (Agent Perturbation Reliability Tests)
Accuracy90.6
5
Safety EvaluationAgentHarm
Cost per Accuracy Point ($)0.0001
4
Showing 8 of 8 rows