Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Safe Agent Evaluation on SafeAgentBench Kitchen N=210

0HAR (%)

Thinker + Diagnostic

-0.4762.7375.959.163Feb 9, 2026
Updated 3mo ago

Evaluation Results

MethodLinks
2026.02
076.810.114.4-10.2
2026.02
063.415.9331.340.5
2026.02
077.312.120.21.125.5
2026.02
0.970.513.321.4-15.5
2026.02
170.419.523.71.124.9
2026.02
1.26712.422.7-11
2026.02
7.657.328.618.91.9871.1
2026.02
9.859.135.414.4-17.1
2026.02
1157.831.915.5130.4
2026.02
11.956.832.716.5-15.2