Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

ToolEmu

Benchmarks

Task NameDataset NameSOTA ResultTrend
Agent Behavioral Safety and Helpfulness EvaluationToolEmu
Safety Rate97.9
42
Agent Safety EvaluationToolEmu
Safety99
36
Policy-refinementToolEmu
IOR33.1
16
Graph-based Agent Memory PoisoningToolEmu
Utilization (Util.)97
5
Propagation DetectionToolEmu filtered 79-case injection-like
Precision62.3
2
Showing 5 of 5 rows