Our new X account is live! Follow @wizwand_team for updates
Home
/
Benchmarks
SQL Agent data leakage evaluation on Employee Toy
Loading...
100
BA (Benign Accuracy)
gpt-4.1-mini
95
97.5
100
102.5
Feb 13, 2026
BA (Benign Accuracy)
RA (Robust Accuracy)
E (Expected Attack Queries)
Updated 4d ago
Evaluation Results
Method
Method
Links
BA (Benign Accuracy)
RA (Robust Accuracy)
E (Expected Attack Queries)
gpt-4.1-mini
Setup=Orchestrator wit...
2026.02
100
84
6
gpt-4.1
Setup=Orchestrator wit...
2026.02
100
75.8
23
o4-mini
Setup=Orchestrator wit...
2026.02
100
90.6
500
claude-sonnet-4
Setup=Orchestrator wit...
2026.02
100
93.6
-
gemini-2.5-flash
Setup=Orchestrator wit...
2026.02
100
75.4
17
gpt-4.1-mini
2026.02
100
87.2
7
gpt-4.1
2026.02
100
77.8
42
o4-mini
2026.02
100
90.4
-
claude-sonnet-4
2026.02
100
95.4
-
gemini-2.5-flash
2026.02
100
76.4
18
Feedback
Search any
task
Search any
task