Our new X account is live! Follow @wizwand_team for updates
Home
/
Benchmarks
Multi-turn Safety Risk Assessment on Playwright tasks
Loading...
80
ASR
w/o Defense
0.96
21.48
42
62.52
Feb 13, 2026
ASR
RR
Updated 4d ago
Evaluation Results
Method
Method
Links
ASR
RR
w/o Defense
Model=Claude-3.5-Sonnet
2026.02
80
16
Firewall
Model=Claude-3.5-Sonnet
2026.02
80
16
w/o Defense
Model=Gemini-1.5-Flash
2026.02
76
8
Firewall
Model=Gemini-1.5-Flash
2026.02
76
8
Baseline
Model=Gemini-1.5-Flash
2026.02
72
12
ToolShield
Model=Gemini-1.5-Flash
2026.02
60
16
Baseline
Model=Claude-3.5-Sonnet
2026.02
40
56
ToolShield
Model=Claude-3.5-Sonnet
2026.02
4
92
Feedback
Search any
task
Search any
task