Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

SafeAgentBench

Benchmarks

Task NameDataset NameSOTA ResultTrend
Embodied Agent Planning (Adversarial Safety Evaluation)SafeAgentBench Unsafe Tasks - Jailbreak
SR54.85
12
Embodied Agent Planning (Safety Evaluation)SafeAgentBench Unsafe Tasks
Success Rate59.87
12
Embodied Agent PlanningSafeAgentBench Safe Tasks
Success Rate75.25
12
Risk IdentificationSafeAgentBench
RIR80.77
12
Safe Agent PlanningSafeAgentBench 1.0 (test)
Harm Rate (HAR)0
10
Safe Agent EvaluationSafeAgentBench Kitchen N=210
HAR (%)0
10
Showing 6 of 6 rows