Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Agent Task Completion and Safety on AgentDojo Workspace stock (test)
Loading...
81.4
Utility Score
CaMeL
49.3056
57.6378
65.97
74.3022
May 27, 2026
Utility Score
Attack Count
Updated 6d ago
Evaluation Results
Method
Method
Links
Utility Score
Attack Count
CaMeL
Model=o4-mini-high
2026.05
81.4
0
LACUNA
Model=gemini-2.5-pro
2026.05
56.25
4
LACUNA
Model=o4-mini-high
2026.05
55.54
0
CaMeL
Model=gemini-2.5-pro
2026.05
53.8
0
TACIT
Model=o4-mini-high
2026.05
52.86
0
TACIT
Model=gemini-2.5-pro
2026.05
50.54
0
Feedback
Search any
task
Search any
task