Our new X account is live! Follow @wizwand_team for updates
Home
/
Benchmarks
Multi-turn Safety Risk Assessment on Terminal tasks
Loading...
96
ASR
w/o Defense
12.8
34.4
56
77.6
Feb 13, 2026
ASR
RR
Updated 4d ago
Evaluation Results
Method
Method
Links
ASR
RR
w/o Defense
Model=Gemini-1.5-Flash
2026.02
96
0
Firewall
Model=Gemini-1.5-Flash
2026.02
92
4
w/o Defense
Model=Claude-3.5-Sonnet
2026.02
76
24
Firewall
Model=Claude-3.5-Sonnet
2026.02
72
28
Baseline
Model=Gemini-1.5-Flash
2026.02
52
32
ToolShield
Model=Gemini-1.5-Flash
2026.02
44
44
Baseline
Model=Claude-3.5-Sonnet
2026.02
40
52
ToolShield
Model=Claude-3.5-Sonnet
2026.02
16
84
Feedback
Search any
task
Search any
task