Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Agent Safety Evaluation on AgentDojo held-out
Loading...
46.8
ASR
FATE
45.424
54.712
64
73.288
May 12, 2026
ASR
TSR
BRR
Updated 21d ago
Evaluation Results
Method
Method
Links
ASR
TSR
BRR
FATE
Backbone=Gemma-3-12B-it
2026.05
46.8
46.2
9.1
FATE
Backbone=Ministral-3-8...
2026.05
48.6
43.8
7.4
FATE
Backbone=Phi-4-reasoning
2026.05
50.3
42.9
8.9
FATE
Backbone=Llama-3.1-8B-...
2026.05
51.2
41.7
8.7
FATE
Backbone=Qwen3-8B-Inst...
2026.05
54
39.2
8.2
Base
Backbone=Gemma-3-12B-it
2026.05
70.4
20.4
13.2
Base
Backbone=Ministral-3-8...
2026.05
73.6
17.6
9.6
Base
Backbone=Phi-4-reasoning
2026.05
74.8
16.8
12.6
Base
Backbone=Llama-3.1-8B-...
2026.05
76.8
15.8
11.8
Base
Backbone=Qwen3-8B-Inst...
2026.05
81.2
13.2
10.4
Feedback
Search any
task
Search any
task