Our new X account is live! Follow @wizwand_team for updates
Home
/
Benchmarks
Code Execution on Multi-Agent Evaluation Set
Loading...
100
R@5
Query+
95
97.5
100
102.5
Jan 11, 2026
R@5
SIM
ASR
Updated 4d ago
Evaluation Results
Method
Method
Links
R@5
SIM
ASR
Query+
Model=GPT-4o
2026.01
100
0.76
-
CEM Attack
Model=GPT-4o
2026.01
100
0.78
-
fusion attack
Model=GPT-4o
2026.01
100
0.85
-
Query+
Model=GPT-4o-mini
2026.01
100
0.75
-
CEM Attack
Model=GPT-4o-mini
2026.01
100
0.78
-
fusion attack
Model=GPT-4o-mini
2026.01
100
0.83
-
Feedback
Search any
task
Search any
task