AgentHarm

Benchmarks

Task Name	Dataset Name	SOTA Result
Malicious behavior measurement	AgentHarm Harmful	Harm Rate0	33
Agent Safety Evaluation	AgentHarm Libra	Score83	27
Agent Safety Evaluation	AgentHarm Benign Requests	Safety Score79	27
Agent Safety Evaluation	AgentHarm Harmful Requests	Score59	27
LLM Agent Utility	AgentHarm Benign Requests	Utility Score69.1	23
Tool-use	AgentHarm (public test)	TSR50	20
Illicit task completion	AgentHarm English prompts	AgentHarm Score (AHS)72.7	20
Step-level tool invocation safety detection	AgentHarm Traj	Accuracy84.81	20
Jailbreak Attack	AgentHarm	Attack Success Score (ASS)2	18
Safety-Utility Trade-off Evaluation	AgentHarm	HS Score82.79	15
Agent behavioral safety	AgentHarm	Safety Rate90.6	14
Guarded Agent Evaluation	AgentHarm latest (full)	Refusal Rate97.16	14
Agent Safety Evaluation	AgentHarm (held-out)	HCR12.5	10
Benign tool-calling reliability	AgentHarm Benign	Refusal Rate0	10
Agent Harm Evaluation	AgentHarm public	HarmScore9.6	8
Agent Safety	AgentHarm	PAIR ASR4	6
Toxicity and Harmful Content Detection	AgentHarm	Score94.69	5
Agent Perturbation Reliability Testing	AgentHarm (Agent Perturbation Reliability Tests)	Accuracy90.6	5
Safety	AgentHarm	Harm Score53.8	4
Safety Evaluation	AgentHarm	Cost per Accuracy Point ($)0.0001	4
Multi-Agent System Routing	AgentHarm	Accuracy45.45	3
Harmfulness Evaluation	AgentHarm	HS Score62.42	3
Safety classification	AgentHarm (val)	Safety Score100	2
Multi-Agent Systems Routing Defense	AgentHarm	Accuracy43.2	1

Showing 24 of 24 rows